agentbench / DECISIONS.md
Nomearod's picture
calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
504a35c
# Design Decisions
## Why build from primitives, not LangChain?
I wanted to demonstrate I understand tool dispatch, memory management,
and retrieval orchestration at the implementation level. My provider
abstraction is ~150 lines. When reranking or a second provider is needed,
I know exactly where it plugs in β€” because I built every layer.
## Why one provider in V1?
The interface supports multiple providers. V1 shipped OpenAI + Mock to
prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming
that switching providers is a one-line config change. The orchestrator
and tools are completely unchanged between providers.
## Why one domain (technical docs)?
Clean text produces clean evaluation. Research papers introduce PDF
parsing noise (tables, figures, formulas) that degrades eval quality
without adding signal. The framework handles any text corpus β€” the
domain is a config swap.
## Why Reciprocal Rank Fusion, not score normalization?
BM25 scores and cosine similarities live on different scales.
Normalizing across scales is brittle: min-max normalization is
sensitive to outliers, z-score requires distribution assumptions.
RRF fuses by rank position `1/(k + rank)`, which is robust,
parameter-light (only k=60), and well-studied. Trade-off: loses
magnitude information, but for top-5 retrieval this rarely matters.
## Why ~16 curated docs, not a large corpus?
Small corpus produces predictable retrieval, reproducible benchmarks,
and easy debugging. Golden dataset questions map to specific source
files. A reviewer can verify any result by reading the source. The
framework scales to larger corpora β€” the choice is about evaluation
quality, not capability.
## Why no reranker in V1?
Feature-flagged in config (`rag.reranker.enabled: false`). V1
benchmarks without reranking establish an honest baseline. V2 adds
cross-encoder reranking and shows the delta.
## Why no delete in the vector store?
FAISS flat index doesn't support efficient deletion. For a small
corpus that changes rarely, rebuild-on-ingest is simpler and
eliminates consistency bugs.
## Why async internals, sync user behavior?
FastAPI and the OpenAI SDK are async-native. Using async for I/O
avoids blocking the event loop. V2 added SSE streaming (`/ask/stream`)
for the final synthesis step β€” tool calls remain non-streamed since
they complete in ~100ms.
## Why SQLite-backed conversation sessions
V1 was stateless by design β€” no conversation_id, no cross-request
memory. V2 adds optional SQLite-backed sessions: pass `session_id`
on `/ask` to persist and load conversation history. When omitted,
behavior is identical to V1 (stateless). See the dedicated
DECISIONS.md entry under "Why SQLite for conversation persistence"
for the full rationale.
## Why negative evaluation cases?
A system that always answers sounds confident but may hallucinate.
5 out-of-scope questions test whether the system refuses gracefully
when the corpus doesn't contain the answer. Grounded refusal
requires both refusal language AND zero cited sources β€” an answer
that says "not found" but still cites docs is not a valid refusal.
## Why deterministic eval + optional LLM judge?
CI needs free, deterministic tests. Retrieval P@5, keyword hit
rate, citation accuracy, and grounded refusal rate run without
API keys. LLM-judged faithfulness and correctness are manual
enrichment steps, run locally, with results committed to the
benchmark report.
## Why structlog, not LangSmith/Langfuse?
Third-party observability contradicts the "built from primitives"
narrative. structlog provides JSON-structured logs, `/metrics`
exposes latency + cost. If a team uses LangSmith, adding it is
a one-day task.
## Why explicit citation format [source: filename.md]?
The system prompt mandates inline citations so the evaluation
harness can parse them with `\[source:\s*(.+?)\]` and check
against the structured sources list. This catches hallucinated
citations and measures citation accuracy as a metric.
## Why per-request retrieval settings via kwargs, not singleton mutation?
The orchestrator is a singleton shared across concurrent requests.
Storing `top_k` / `strategy` on `self` causes cross-request state
bleed. Instead, these are passed as local variables through the
tool execution kwargs β€” no shared state is mutated.
## Why a relevance threshold for grounded refusal
V1 never refuses β€” it always retrieves tangentially related content and
synthesizes an answer. This is a trust failure: users cannot distinguish
"the system found relevant information" from "the system fabricated from
vaguely related chunks." Grounded refusal rate was 0/5.
We add a refusal gate in `SearchTool.execute()` based on the maximum RRF
score across retrieved chunks. If no chunk scores above the threshold, the
tool returns "No relevant documents found" β€” the LLM then refuses via the
system prompt rather than fabricating from irrelevant content.
**Gate location:** The gate fires in `SearchTool.execute()`, not the
orchestrator. `SearchTool` is where retrieval scores are still available β€”
they are dropped before results reach the orchestrator. This also keeps
the orchestrator unchanged.
**Threshold value:** `rag.refusal_threshold: 0.02` is a provisional default
pending an empirical sweep across the evaluation set. The sweep will test
values 0.01–0.03 and select the value that maximizes refusal on out-of-scope
queries without degrading in-scope P@5 and R@5. The actual RRF score
distribution will be documented here after tuning.
**Interaction with reranking:** The refusal gate fires on RRF scores BEFORE
reranking. It is a go/no-go decision, not a per-chunk filter. If the gate
passes, the full candidate set proceeds to the reranker. This keeps the
threshold calibration independent of whether reranking is enabled.
**Default disabled:** `refusal_threshold: 0.0` preserves V1 behavior exactly.
The feature is opt-in until the threshold is tuned.
**Alternative considered:** LLM-based relevance judgment ("is this content
relevant to the query?"). Rejected because it adds latency, cost, and a
second point of failure. The score-based approach is deterministic, fast,
and debuggable.
## Why cross-encoder reranking improves precision
BM25 retrieves lexically similar but semantically irrelevant chunks.
RRF fusion mitigates this partially, but noisy BM25 results still
dilute the top-5 set. P@5 was 0.70 in V1.
A cross-encoder (`ms-marco-MiniLM-L-6-v2`, ~80MB) scores each
(query, chunk) pair jointly, capturing semantic relevance that
bi-encoder similarity misses. The tradeoff is ~100–200ms extra latency
per query β€” acceptable given our 4.7s baseline is dominated by LLM
generation, not retrieval.
The reranker is enabled by default. Setting `rag.reranker.enabled: false`
restores V1 behavior exactly. `reranker.top_k` is independent of
`retrieval.top_k`, so the reranker's output count can be tuned without
affecting the RRF candidate pool.
The retriever passes all RRF-fused candidates to the reranker rather
than a computed subset. The reranker's `top_k` handles truncation.
This is simpler and more robust than computing an input size from
per-system candidate counts.
## Why provider retry with exponential backoff
OpenAI returns 429 (rate limit) errors under load. Without retry logic,
a single 429 causes a user-visible failure. We add exponential backoff:
attempt after 1s, 2s, 4s. After 3 retries, raise `ProviderRateLimitError`
so the middleware returns a clear 503.
The retry wraps the raw `openai.RateLimitError` β€” it must fire BEFORE
the error gets translated to `ProviderRateLimitError`, otherwise retry
logic is dead code. Other errors (400, 401, timeout) fail immediately.
## Why in-memory API rate limiting
A public-facing API needs abuse protection. We use a simple in-memory
sliding window limiter: 10 requests/minute per IP. Sufficient for a
demo deployment; a production system would use Redis.
Known limitation: the per-IP dict grows without bound across distinct
IPs. Acceptable for Fly.io with auto-stop (memory resets). If running
continuously under bot traffic, add a periodic sweep or switch to a
TTL-based structure.
Design choices:
- `/health` and `/metrics` exempt: monitoring should never be rate-limited.
- `Retry-After` header: follows HTTP 429 spec, lets clients back off.
## Why SQLite for conversation persistence
Three options considered:
1. In-memory dict: Lost on restart.
2. SQLite: Zero-dependency, file-based, survives restarts.
3. Redis/PostgreSQL: Adds infrastructure complexity.
SQLite is right for this scale. `session_id` is optional β€” when omitted,
the system behaves identically to V1 (stateless). This preserves backward
compatibility and keeps benchmark evaluation deterministic.
The route handler manages session state (load history, store Q+A), not
the orchestrator. The orchestrator accepts an optional `history` parameter
but has no knowledge of persistence. This keeps the agent loop testable
without a database.
Note: On HF Spaces, SQLite is ephemeral (no persistent storage on free
tier). For the demo this is acceptable β€” sessions last until the container
sleeps. Production would use a volume or managed database.
## Why a second provider (Anthropic)
The provider abstraction existed since V1 but only had OpenAI + Mock.
Adding Anthropic proves the abstraction works across fundamentally
different APIs:
- System message: `system=` parameter, not in the messages list
- Tool definitions: `input_schema` instead of `parameters`
- Tool results: `tool_result` content blocks in user messages
- Tool calls: `tool_use` content blocks, not a separate field
- Stop reason: `tool_use` vs `stop`
The implementation is a config swap β€” `provider.default: anthropic` in
YAML switches the entire system to Claude. The orchestrator, tools,
evaluation harness, and serving layer are completely unchanged.
Same retry/timeout handling as OpenAI. Both providers are tested with
mocked HTTP responses β€” no API keys needed in CI.
## Why ranked_sources separate from deduplicated sources?
The deduplicated `sources` list in `AgentResponse` is for the API
response. The `ranked_sources` list preserves rank order with
duplicates for evaluation metrics. P@5 and R@5 need the raw
retrieval ranking, not the post-processed answer metadata.
## Why vLLM over TGI / llama.cpp
vLLM has the widest model support, best throughput via PagedAttention, and a native
OpenAI-compatible server (`/v1/chat/completions`). TGI is a valid alternative; llama.cpp
targets different use cases (edge/CPU inference). This is a deliberate choice, not
ignorance of alternatives.
## Why Modal for GPU inference
Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs
~$0.50 per full 27-question benchmark run. The Docker Compose path (`docker-compose.vllm.yml`)
is retained for users who have local GPUs or prefer persistent serving.
## Why split topology (K8s API + Modal GPU)
The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from
horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from
serverless elasticity β€” Modal scales to zero when idle, scales up on demand with no node
provisioning. Co-locating both in K8s would require GPU node pools with idle cost,
node autoscaler latency, and NVIDIA device plugin management. This mirrors a common
production pattern.
## Why Helm only, not Kustomize + Helm
Showing two K8s deployment methods for the same app adds complexity without demonstrating
distinct skills. Helm with `values-dev.yaml` / `values-prod.yaml` covers
environment-specific configuration cleanly.
## Why CPU-based HPA, not custom metrics
CPU utilization works without a Prometheus adapter or custom metrics server. A production
improvement would use the Prometheus adapter to scale on p95 latency from the `/metrics`
endpoint β€” this requires bridging the JSON metrics to Prometheus exposition format.
Documented as a follow-up.
## Why env var fallback in SelfHostedProvider
Follows the same pattern as OpenAIProvider reading `OPENAI_API_KEY`. The YAML config
provides defaults; env vars override at runtime. No config loader changes needed.
## Why lazy tool-call detection, not metadata check
Checking `/v1/models` metadata for tool-calling support is unreliable β€” model metadata
doesn't consistently report this capability. Instead, the provider sends one tool-calling
request on first `complete()` call with tools and checks if the response contains
`tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures
(timeout, 5xx) return `None` and retry on the next call rather than permanently
downgrading to prompt-based fallback.
## Why two-tier injection detection, not three
The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity β€” "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth.
Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only β€” documented, not hidden.
## Why regex + optional spaCy for PII, not a cloud API
Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses.
spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (`try: import spacy`) degrades gracefully with a logged warning β€” no crash if someone sets `use_ner: true` without installing spaCy.
## Why append-only JSONL for audit, not SQLite
One codepath, one format, no config branching. JSONL is append-only by nature β€” no schema migrations, no transactions, no connection pooling. Log rotation handles size. `jq` provides immediate queryability without building a custom API.
The original design included an optional SQLite backend and a query endpoint (`GET /admin/audit`). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication β€” an inconsistency when `/ask` itself has no auth.
JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned.
## Why HMAC-SHA256 IP hashing in audit logs
HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, `AUDIT_HMAC_KEY` env var, or (with a logged warning) a random per-process fallback.
## Why three output validators, not four
The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks β€” PII leakage, URL hallucination, blocklist β€” are deterministic with clear pass/fail semantics.
## Why buffer-then-validate for streaming output
The `/ask/stream` endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative β€” streaming chunks immediately and appending a safety marker β€” leaks unsafe content to any client that stops reading after the `done` event.
## Why no authentication on API endpoints
The HF Spaces demo is public by design β€” the `curl` examples in the README work without credentials, which is the point. Adding API key authentication would gate access but break the zero-friction demo experience that makes the project evaluable.
The security pipeline protects *content* (injection detection, PII redaction, output validation), not *access*. This is a deliberate scope boundary: application-layer guardrails ensure the system behaves safely regardless of who calls it, rather than assuming trusted callers. Rate limiting (10 RPM per IP) provides basic abuse protection.
A production deployment would add authentication (API keys or OAuth) at the infrastructure layer β€” reverse proxy, API gateway, or middleware. The security pipeline's `getattr(..., None)` pattern means auth can be layered on without modifying the existing security components.
## Why monitor mode for output validation, not gating?
Output validation runs post-stream as a monitoring layer. The answer
streams to the client, then validation runs and emits its verdict. Gating
(buffer-then-validate) would add 4-5 seconds of dead air while the full
answer generates β€” unacceptable streaming UX for a documentation Q&A bot.
Trade-off: a hallucinated URL or PII fragment could reach the client
before validation catches it. For this use case (FastAPI docs, no real
PII in corpus), the risk is near-zero. The dashboard labels this
"monitored" (not "gated") to be explicit about the posture.
## Why named residual risks and scope limits, not "fully mitigated" verdicts?
The OWASP LLM Top 10 (2025) mapping could have been written as a 10-row table where LLM01 and LLM02 read as "addressed" without qualifiers β€” shorter and cleaner-looking. Rejected because OWASP's own 2025 text is explicit about what an input guardrail can and cannot do, and writing a verdict that contradicts the source the mapping cites would be compliance theater.
LLM01 Prompt Injection β€” OWASP 2025 states that RAG and fine-tuning do not fully mitigate prompt injection, and that indirect injection through retrieved content remains a core risk class. "Fully mitigated" is unsupportable for any system retrieving untrusted content into an LLM context window, which is every RAG system including this one. The LLM01 verdict reads "addressed directly with named residual risk"; the residual-risk cell cites OWASP's own "do not fully mitigate" language verbatim.
LLM02 Sensitive Information Disclosure β€” OWASP 2025's LLM02 mitigations span four concern classes: access controls, training-data handling, user-consent transparency, and proprietary-information governance. This implementation addresses a narrower output-side subset (output validation for PII leakage, secret formats, and URL hallucination) β€” not cleanly one of the four concern classes, but a narrower scope than any of them. The verdict reads "addressed directly for the applicable scope"; the scope-limit cell enumerates the four concern classes verbatim and names what addressing the broader concerns would require (multi-tenant or authenticated architecture).
The tension the entry resolves is honesty-vs-scannability: a mapping that surfaces named residual risks and scope limits is longer and harder to skim than one with uniform "addressed" verdicts, but the scannable version over-claims relative to the cited source. Honest evaluation is the brand. Every verdict cell in SECURITY.md must survive a reviewer reading OWASP 2025 in a second tab.
See [SECURITY.md Β§ LLM01 Prompt Injection](SECURITY.md#llm01-prompt-injection) and [Β§ LLM02 Sensitive Information Disclosure](SECURITY.md#llm02-sensitive-information-disclosure) for the verdict cells; this entry covers why the verdict discipline takes the form it does. The LLM01 "do not fully mitigate" phrasing and the LLM02 four-concern-class enumeration are canonical in SECURITY.md; the README tail and landing-page subtitle paraphrase but must preserve the named-residual-risk and scope-limit structure.
## Why additive SSE stage events?
The enhanced `/ask/stream` adds `meta` and `stage` event types alongside
the existing `sources`, `chunk`, and `done` events. Existing consumers
that only handle the three legacy types are unaffected β€” they simply
ignore events with unknown types. This avoids versioning the endpoint
or breaking the non-streaming `/ask` contract. The `meta` event fires
first (before any stages) so the frontend can display provider/model
info immediately.
## Why vanilla JS for the frontend, not Alpine or React?
The showcase dashboard has ~5 pieces of reactive state (pipeline stages,
retrieval results, security badges, stats, chat messages). The SSE
handler is inherently imperative: receive event, querySelector the
target node, update classList and textContent. Wrapping this in a
reactive framework adds a dependency, interview questions about
"why is there a framework for 5 state variables", and indirection
that fights the imperative SSE pattern. One `state` object + a few
`render()` functions handles it in ~150 lines.
## Phase 1 SSE gate closure β€” two baselines on record, not one
The Phase 1 acceptance gate for the SSE backend work (meta event,
stage events, iteration-aware metadata threading, route-level
injection/output-validation events) requires re-running
`make evaluate-fast` and confirming numbers match pre-change state
on the pinned `gpt-4o-mini-2024-07-18` snapshot. The re-run was
honored literally rather than substituted with a git-diff
argument, even though the SSE commits did not touch
`scripts/evaluate.py`'s legacy code path. Two reasons: the
re-commitment discipline that kept Fix 1 and Fix 2 honest applies
equally here, and the legacy path and the `--corpus fastapi` path
produce materially different baselines that cannot substitute for
each other.
**Two distinct baselines now exist at the pinned snapshot, and
both are on record** β€” one per prompt path:
| Baseline file | Invocation | Prompt source | In-scope P@5 | In-scope R@5 | Citation | Mean calls |
|---|---|---|---|---|---|---|
| `results/fastapi_preedit.json` @ `213da36` | `--corpus fastapi` | `format_system_prompt("FastAPI")` | 0.718 | 0.833 | 1.000 | 1.14 |
| `results/fastapi_legacy_baseline_pinned.json` @ this commit | `make evaluate-fast` (no `--corpus`) | `tech_docs.yaml` `task.system_prompt` | 0.655 | 0.849 | 1.000 | 1.45 |
Citation accuracy holds at 1.000 on both paths, both in-scope and
out-of-scope. The retrieval metric deltas (P@5 βˆ’0.063, R@5 +0.016,
KHR +0.045) and behavioral delta (mean tool calls +0.318 in-scope,
+1.00 out-of-scope) trace to the prompt-path divergence
(`scripts/evaluate.py:67` reads `task.system_prompt` in the legacy
branch vs. `format_system_prompt(label)` in the `--corpus` branch),
not to any change in retrieval, reranking, or refusal-gate code.
This divergence is the same one the "evaluation-layer multi-corpus
support lagged the serving-layer refactor" entry documents; the
narrowed serving-migration deferral tracks its eventual migration.
**Why both baselines are retained.** When the serving-migration
deferral lands and `scripts/evaluate.py`'s legacy branch is removed
(everything routes through `--corpus fastapi`), the regression gate
is "post-migration `make evaluate-fast` output matches pre-migration
`--corpus fastapi` output within pre-committed tolerances." That
gate requires the `--corpus fastapi` baseline as the comparison
reference AND the legacy baseline as evidence of the pre-migration
state that is being retired. Retaining both makes the migration
auditable and bounds its regression budget; retaining only one
would force the post-migration run to compare against a baseline
from a different prompt path, guaranteeing the gate fires on
prompt divergence rather than on any actual regression.
**Gate verdict: passed.** No regression vs pre-SSE legacy path
expectations (citation 1.000 holds, refusal gate fires on the same
5 out-of-scope questions, retrieval numbers in sane in-scope
ranges). Phase 1 SSE backend work is closed from the backend side;
the frontend's consumption of iteration-aware stage events is
orthogonal and owned by Week 1 step 7 (showcase UI).
## Why per-corpus refusal thresholds?
FastAPI and Kubernetes have different corpus characteristics. FastAPI
has 16 short, well-structured docs with sparse cross-references β€”
relevance tends to concentrate in 1-2 chunks per query. Kubernetes
has 30-40 docs with heavy cross-referencing between concepts (Pod β†’
Deployment β†’ Service β†’ Ingress), which spreads relevance across more
chunks. A single global refusal threshold would either refuse too
aggressively on K8s (no single chunk dominates, so the top score
looks "low") or not aggressively enough on FastAPI (where a
moderate-scoring chunk might be the only hit and should still refuse).
`CorpusConfig` carries `refusal_threshold` as a per-corpus field.
Each threshold gets tuned against its own golden dataset β€” there
is no "fair" shared threshold because BEIR showed these are not
comparable across corpora. Placeholder values ship in default.yaml
and are replaced by tuned values during the per-corpus evaluation
sweep.
## Why corpus and provider toggles compose β€” corpus_map[corpus][provider]
The simpler design would have been `corpus_map[corpus]` returning a
single orchestrator. It ships in 10 fewer lines. It also silently
breaks the provider toggle in multi-corpus mode: the orchestrator
inside each corpus cell holds one fixed provider, and clicking
"Anthropic" in the dashboard keeps running on OpenAI.
This project's hero-tile metric is the provider comparison (`1.00 API /
0.14 7B self-hosted`). Breaking the mechanism that demonstrates that
metric β€” on a portfolio demo where a reviewer will open DevTools and
notice β€” would erode the honest-evaluation brand the whole repo is
built around. The nested `corpus_map[corpus][provider]` structure
keeps both toggles functional. Store, retriever, and search tool are
shared across providers within a corpus (the expensive objects are
held once per corpus); only the orchestrator varies per provider
since it holds the LLM client. Per-corpus Γ— per-provider memory
overhead is an orchestrator struct, not a FAISS index.
RSS is logged per corpus, not per corpus Γ— provider, because the
store is what drives memory. The provider multiplier is negligible
compared to a hybrid index + embedder.
## Why one parameterized system prompt, not per-corpus templates
The template is `"You are a technical documentation assistant for
{corpus_label}..."`. The only corpus-specific element is the label;
prompt content is identical across corpora: same citation format,
same refusal language, same grounding instructions. Having two
separate prompt files would invite drift β€” someone tweaks the FastAPI
prompt for a specific failure mode and forgets to update the K8s
version, and the demo silently answers differently on the two toggles.
The parameterization is enforced by two tests: (a)
`format_system_prompt("")` raises `ValueError` so an unresolved
`{corpus_label}` can never reach the LLM, and (b) a spy on
`orchestrator.run_stream` asserts FastAPI and K8s requests receive
different prompts with the correct label substituted.
The wording deliberately differs from the typical "don't hallucinate"
RAG template:
- **"refuse the question explicitly"** matches our refusal-gate
mechanism. "Say so politely" is soft language that models interpret
as "hedge and answer anyway".
- **"do not infer, do not extrapolate, do not draw on general
knowledge"** is the three-verb prohibition. "Do not fabricate" is
empirically easier to slip past because models distinguish
fabrication (making things up) from extrapolation (drawing
conclusions from adjacent but non-authoritative context).
## Why Kubernetes curation targets recruiter-likely questions, not coverage
The K8s corpus targets ~30-40 pages curated around concepts a
technical reviewer would naturally type (Pod, Deployment, Service,
Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that
stress the reranker. Cluster administration deep-dives, tutorials,
and kubectl reference are explicitly excluded β€” they add noise without
adding reviewer value and hurt retrieval precision when adjacent
content is thin on concept definitions.
`data/k8s_docs/SOURCES.md` is a version-controlled curation artifact.
Each ingested URL has a one-line rationale, a date pulled, and a
license note. This makes the corpus reproducible and documents the
curation reasoning for any reviewer who looks closely.
Trade-off: the corpus is not comprehensive K8s knowledge. A question
about etcd raft internals will be correctly refused. This is not a
bug β€” the refusal is part of the demo story, and "the system knows
what it doesn't know" is a feature of the grounded-refusal mechanism.
## Why no cross-corpus score comparison (inspired by BEIR)
Inspired by BEIR's heterogeneous-benchmark framing (Thakur et al.,
NeurIPS 2021), which spans 18 datasets across 9 task types, absolute
retrieval scores are not treated as comparable across FastAPI and
K8s corpora β€” score distributions depend on chunk length, vocabulary
overlap, and corpus density, none of which are held constant across
domains. Only rank-ordering of system configurations within a single
corpus is meaningful. Concrete consequences for this repo:
- Per-corpus evaluation results are reported separately, never
aggregated into a single "combined" number.
- The hero-tile citation accuracy (`1.00 API / 0.14 7B self-hosted`)
stays FastAPI-specific. It is not restated as a cross-corpus average.
- `make evaluate-fast` accepts a `--corpus` flag but has no "combined"
mode. Anyone who wants a cross-corpus number has to run twice and
acknowledge the incomparability in prose.
- The landing page "Key Findings" cards avoid sentences that compare
FastAPI and K8s numbers directly.
The multi-corpus demo is a **surface feature for interactive
exploration**, not a rebenchmark. The benchmark section of the README
remains FastAPI-only and cites 27 questions on 16 docs with specific
chunker settings.
## K8s golden dataset uses CRAG's 8-type taxonomy as the schema
The K8s golden dataset uses CRAG's 8-type taxonomy (Yang et al.,
NeurIPS 2024) **as the schema** for `question_type`, not as a
requirement to cover all 8 types. CRAG's taxonomy: `simple`,
`simple_w_condition`, `set`, `comparison`, `aggregation`,
`multi_hop`, `post_processing_heavy`, `false_premise`. Temporal
dynamism is a separate orthogonal property captured as
`time_sensitive: bool` on the question schema β€” it is not a CRAG
category.
Target distribution across the 25-question K8s golden set:
- `simple` (5–6): baseline retrieval
- `simple_w_condition` (3–4): nuanced understanding under conditions
- `comparison` (3–4): retrieval across concept pages, reranker stress
- `multi_hop` (5–6): synthesis across 2–4 docs, reranker stress
- `false_premise` (3–4): grounded refusal mechanism
- `set` / `aggregation` / `post_processing_heavy` (0–3): included
only where corpus content naturally supports
`time_sensitive: bool` flags 2–3 questions targeting version-bounded
content (feature state, deprecations, API version migration).
`false_premise` questions come in two flavors (see separate
"False-premise questions come in two flavors" entry): pure refusal
(flavor A) and documented negative (flavor B). The K8s set includes
at least one of each. Flavor A tests the path where retrieval
correctly returns nothing useful; flavor B tests the path where the
corpus contains an explicit negative answer and the agent must
surface it with citation rather than confabulating a positive.
Rationale for using CRAG as schema (not coverage requirement):
`false_premise` and `time_sensitive` stress grounded refusal and
reduce test-set contamination risk; `multi_hop` and `comparison`
stress the reranker because relevance spreads across multiple
chunks. The distribution was chosen to exercise the parts of the
pipeline the benchmark story claims β€” not to mimic a general-purpose
QA benchmark.
The golden dataset JSON schema (v2, backward-compatible with the
FastAPI flat list) includes:
- `source_chunk_ids: list[str]` for multi-hop partial credit
(answer must cite at least one of the expected chunks)
- `source_snippets: list[str]` for human-readable context during
review
- `question_type: str` (CRAG taxonomy value)
- `is_multi_hop: bool` for filtered reporting
- Dataset-level header with `corpus`, `version`, `snapshot_date`,
and pinned `chunker` parameters so the dataset is reproducible
against a specific K8s docs snapshot
See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` for the
full schema and rationale.
## EU AI Act corpus deferred to v1.2
EU AI Act compliance mapping is deferred to v1.2. Rationale: v1
ships two corpora (FastAPI, K8s) to demonstrate the multi-corpus
architecture; EU AI Act as a third corpus would add ingestion and
golden-set work without exercising architecturally new surface.
Scoped as the first v1.2 addition after v1 launch.
## Cold-start contingency: measure first, lazy-load if needed
Loading two corpora at startup costs memory and cold-start time. On
HF Spaces (target deployment), the realistic ceiling is 8-10 GB
resident RAM and ~60 seconds cold-start before the demo feels broken.
**Policy:**
1. Measure HF Spaces cold-start on Day 1 of deployment.
2. If cold-start < 60 s: plan validated, no changes.
3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager,
K8s lazy on first K8s request). Scoped ~2 hours implementation.
This contingency is **not** pre-built. Pre-building a lazy-load path
that may never ship creates dead code that rots, and the test surface
for "lazy loading plus corpus routing plus provider switching" is
non-trivial. The RSS logging in `app.py` (Task 2) emits the exact
numbers needed to make the decision; the decision is documented here
so future-me remembers the threshold and doesn't optimize prematurely
on a hunch.
## Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause
The preceding "Cold-start contingency" entry pre-committed a lazy-load
fix (FastAPI eager, K8s lazy on first request) if the measured cold
start exceeded 60 seconds. Measurement falsified the entry's core
assumption: **corpus loading is not the dominant cold-start cost**.
The committed fix addresses ~1 % of the observed overshoot. Executing
it verbatim would honor the gate's letter but not its intent β€” theater
dressed as discipline. This entry documents the measurement, the
falsified assumption, and the new contingency pre-committed at the
actual cause.
**Measurement (N=3, 2026-04-15, HF Spaces target deployment):**
| Sample | Cold start | Silent Python init | Visible phase |
|---|---|---|---|
| N=1 | 113 s | ~101 s | ~12 s |
| N=2 | 89 s | ~70 s | ~19 s |
| N=3 | 129 s | ~115 s | ~14 s |
- Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
- **Gate fire is unambiguous at both tails.** Even the fastest sample
(89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is
~115 % over. No boundary ambiguity.
- **Sample-size justification.** N=3 is acknowledged as a small sample.
It is adequate here because (a) the gate-fire conclusion is stable
across both tails, (b) the "silent Python init dominates variance"
finding is stable across all three samples (silent phase varies
70 β†’ 115 s across runs; visible phase varies only 12 β†’ 19 s), and
(c) the cost of additional samples (manual HF Space restart + ~2 min
wait + log extraction per sample) exceeds the marginal information
gain once both tails fire the gate and the variance pattern is stable.
N=4 would tighten the confidence interval on the median but does not
change either the gate-fire conclusion or the falsified-assumption
finding.
- **Variance source named.** HF Spaces shared-infrastructure CPU / IO
contention during Python module imports. The silent-init phase
varies 45 s across samples (70 β†’ 115 s); the visible phase is stable
(12–19 s). That is the signature of host-level contention on a
shared physical node, not code-level variability. An
exclusively-owned container would plausibly show a tighter bound.
- **Raw log captures** (preserved so this entry can be cross-checked
against the underlying evidence without re-running the measurement):
`measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`.
**Where the cost lives.** At the median (113 s):
- **Silent Python init phase β€” ~90 s (β‰ˆ 80 % of total):** interpreter
start, module imports (`torch`, `transformers`, `langchain`, `faiss`,
`fastapi`, `httpx`, the full dependency closure), and initial model
weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder
reranker). Not logged β€” no observability inside the import chain.
- **Visible startup phase β€” ~15 s (β‰ˆ 15 % of total):** injection
classifier init (~10 s, includes the "classifier skipped" warning),
FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s,
+25.8 MB RSS), reranker warmup (~2 s).
**The K8s corpus load β€” which the pre-committed fix was designed to
defer β€” contributes under 1 second of the 113-second median.**
Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load
is the same order of magnitude. Corpus loading is simply not where the
cost lives on this deployment.
**Why we are not executing the pre-committed fix.** The preceding
contingency was written under an empirical assumption about cost
attribution (corpus loading is the dominant cost). Measurement
falsified the assumption. Implementing the fix anyway would be a
mechanical execution of a recipe whose premise has been disproven β€”
it checks the gate-honoring box while failing to address the cause.
That is structurally identical to relaxing-by-redefinition ("60 s was
too tight"), just in the opposite direction: **relaxing by execution**.
The pre-commitment rule's purpose is to prevent motivated reasoning
about the gate, not to mandate mechanical compliance with a recipe
whose empirical foundation has collapsed.
The honest action is (1) accept the measurement as the v1 baseline,
(2) document the falsified assumption explicitly (this entry),
(3) re-pre-commit a new contingency at the actual dominant cost with
an explicit trigger condition so the decision is not relitigated at
review time, and (4) update the user-facing README surface to reflect
the measured cold-wake number rather than the optimistic pre-deploy
estimate.
**v1.1 contingency β€” pre-committed:**
> **If HF Spaces traffic produces more than N cold wakes per day**
> (N to be determined from observed usage patterns after launch, **not
> estimated in advance**), defer eager loading of (a) the cross-encoder
> reranker, (b) the sentence-transformers embedder, and (c) the
> injection classifier tier to first-relevant-request.
>
> **Estimated work:** 4–6 hours (lazy-init wrappers + first-request
> caching + integration tests for the warm/cold transition).
>
> **Expected tradeoff:** cold wake ~113 s β†’ ~50–60 s (approaches the
> original 60 s target); **first request after any cold wake incurs
> +8–15 s** additional latency (model weights load synchronously in
> the request path), after which subsequent warm requests return to
> normal ~5 s latency.
>
> **Trigger is usage-justified, not estimate-justified.** Until real
> traffic data justifies the work, there is nothing to optimize β€” a
> recruiter demo that gets one cold wake per day does not pay for
> 4–6 hours of engineering plus the new first-request-latency failure
> mode. The trigger threshold N is left unnamed deliberately: naming a
> number in advance would invite the same falsification pattern this
> entry is documenting.
**Methodology lesson.** When a pre-committed contingency is written
under an empirical assumption, the contingency only holds if the
assumption survives measurement. If measurement falsifies the
assumption, the correct action is to document the falsification,
accept the observed baseline, and re-pre-commit at the actual cause.
The wrong action is to execute the original recipe anyway, which
trades one form of motivated reasoning (threshold relaxation) for
another (recipe compliance). The underlying discipline β€” "pre-commit
your gates and honor them" β€” does not mean "mechanically run the
pre-committed fix regardless of what it addresses." It means "honor
the gate's *intent*, which is to prevent motivated reasoning about
pass/fail."
**Post-hoc refinement (2026-04-22) β€” three latency regimes observed.**
Follow-up warm-latency measurements after the initial cold-start
characterization distinguish a middle regime that the README's
"~2 min cold / ~5 s warm" phrasing collapses:
| Regime | Latency | Cause |
|---|---|---|
| Cold-start (container spin-up) | 89–129 s | Python imports, model load (per the table above) |
| Wake-from-idle (first `/ask` after container is up) | ~6.7 s (β‰ˆ +2 s over warm) | Residual lazy init on first request |
| Steady-state warm | ~5 s (mean 4.74 s, n=5) | LLM API round-trip + retrieval |
Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state
warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim
captures regimes 1 and 3; the ~2 s first-hit penalty between them is
a refinement of the warm-cold boundary, not a contradiction. The
v1.1 projection of "+8–15 s first request after any cold wake"
measures the delta from this ~2 s current baseline, not from a
zero-penalty starting point β€” the true v1.1 UX cost over current
behavior is closer to +6–13 s.
## False-premise questions come in two flavors
When authoring golden-dataset questions whose premise is wrong, the
question can point at one of two genuinely different failure modes.
Both are valid; they test different pipeline paths and should be
labeled distinctly so the evaluator routes correctly.
**Flavor A β€” pure refusal.** The premise is not addressed anywhere in
the corpus. Example: "How do I configure Claude API rate limits in
Kubernetes?" K8s has no such concept. Schema: `category: "out_of_scope"`,
`expected_sources: []`, `source_snippets: []`. The evaluator's
`grounded_refusal` metric expects the answer to contain a refusal
phrase ("does not contain", "no information") AND cite zero sources.
Tests the pipeline path where retrieval correctly returns nothing
useful and the agent correctly declines.
**Flavor B β€” documented negative.** The corpus contains an explicit
negative answer. Example: "How do I configure NetworkPolicy to enforce
mTLS?" The K8s NetworkPolicy docs have a "What you can't do with
network policies" section that explicitly says "Anything TLS related
(use a service mesh or ingress controller for this)". Schema:
`category: "retrieval"`, `question_type: "false_premise"`,
`expected_sources: [<the negative-answer page>]`, `source_snippets:
[<the verbatim negative statement>]`. The evaluator expects the agent
to retrieve the page, find the negative statement, and answer
negatively with a citation. Tests the stricter path where the corpus
genuinely contains the answer and the agent must not hallucinate a
contradictory capability.
**Why both matter for the honest-evaluation brand.** Grounded refusal
is not "refuse when retrieval is weak." It is "answer exactly what the
source says, including when the source says no." Flavor A tests the
first half (refuse when there is nothing to ground on); flavor B tests
the second half (report the documented negative instead of
confabulating a positive). The K8s golden dataset includes at least
one of each. The first K8s pilot (`k8s_pilot_005`, NetworkPolicy
mTLS) is flavor B. Flavor A is reserved for questions targeting
features that genuinely do not exist in the K8s corpus; at least one
such question is required in the full 25-question set.
## Pilot_005 refusal-gate + agent-behavior measurement
The first K8s pilot run surfaced two distinct flavor-B failure modes
on `k8s_pilot_005` (NetworkPolicy mTLS). Both are empirical, both
have specific numbers, and both are logged in
`results/k8s_pilot_threshold_0.02.json` and
`results/k8s_pilot_threshold_0.015.json`.
**Failure mode 1 β€” threshold calibration (at 0.02).** The
`SearchTool.execute()` refusal gate fired with `max_score=0.01639` β€”
exactly `1/(60+1)`, the rank-1 RRF score from a single fusion system.
BM25 hit "NetworkPolicy" at rank 1; the dense encoder contributed
nothing, because "Anything TLS related (use a service mesh or ingress
controller for this)" is a single negative sentence, not a conceptual
topic the page is semantically "about." Hybrid fusion inherited only
the BM25 rank-1 score. At threshold 0.02 (the FastAPI working value),
the gate refused before the agent saw any chunks. Retrieval P@5 and
R@5 both 0.00; answer is a generic refusal.
**Failure mode 2 β€” agent behavior on documented negative (at 0.015).**
With the threshold dropped just below the measured max score
(`0.015 < 0.01639`), retrieval is perfect: P@5 1.00, R@5 1.00, all
five top chunks from `k8s_network_policies.md`. But the agent still
produces a flavor-A-style refusal: *"The Kubernetes documentation
does not provide specific instructions on configuring a NetworkPolicy
to enforce mutual TLS..."* The "Anything TLS related" sentence is in
the retrieved chunks β€” the agent simply treats the absence of
positive instructions as grounds for refusal, rather than reading the
explicit negative sentence and citing it as the answer. KHR 0.67: the
`service mesh` and `ingress controller` keywords (the documented
alternatives the page points to) are missing from the answer.
**Implication.** The flavor-B mechanism requires more than threshold
tuning. Fixing the gate is necessary but not sufficient. The system
prompt needs a flavor-B clause (e.g., *"if the documentation
explicitly says a feature does not exist or is not supported, report
that with citation β€” do not treat it as unanswerable"*), **or** the
K8s golden dataset's flavor-B questions must use phrasing the
current prompt can route correctly. The 0.30 placeholder value from
the design doc was based on "prefer conservative" intuition without
empirical grounding β€” the measured working range for K8s pilot
retrieval is lower by more than an order of magnitude than that
intuition, and even at the working threshold the prompt layer is the
blocker.
**What this measurement is.** A pilot smoke-test result, not a
benchmark claim. Aggregates at 0.02: P@5 0.63, R@5 0.83, KHR 0.69.
Aggregates at 0.015: P@5 0.80, R@5 1.00, KHR 0.75. Five of six pilots
produce substantively correct answers on K8s content under the
working threshold β€” evidence the retrieval stack generalizes to K8s.
The pilot's job was schema validation + calibration evidence, not
launch metrics. Launch metrics come from the 25-question K8s golden
set with tuned threshold and (likely) a revised system prompt,
sequenced after this pilot.
## Evaluation-layer multi-corpus support lagged the serving-layer refactor
The Tasks 1–8 multi-corpus refactor wired corpora through
`app.state.corpus_map` and the `/ask` serving route. `scripts/evaluate.py`
was not touched and remained single-corpus β€” it read
`config.rag.store_path` and `config.evaluation.golden_dataset`
directly, with no awareness of the `corpora` dict. This was an
accurate scoping of the refactor (serving-layer, not eval-layer) but
the gap was not surfaced in the original task list.
The K8s pilot commit adds `--corpus <name>` to `scripts/evaluate.py`,
routing through `config.corpora[name]` for `store_path`,
`refusal_threshold`, and a new optional `golden_dataset` field on
`CorpusConfig`. Without `--corpus`, the legacy single-store path is
preserved for backward compatibility with `make evaluate-fast` and
any existing invocations.
`CorpusConfig.golden_dataset` is `str | None = None` β€” optional
rather than required β€” because two legitimate states exist: corpus
has a golden dataset (FastAPI, K8s post-authoring), and corpus has no
golden dataset yet (any corpus during bring-up). The CLI errors
cleanly with *"corpus '<name>' has no golden_dataset configured"*
when the field is None, rather than requiring all corpora to ship
with datasets.
## Deferred: path-preserving ingestion
`scripts/ingest.py` uses `doc_path.glob("*.md")` (non-recursive) and
stores the bare filename as the chunk's `source` field. This forces
a flat-namespace convention: FastAPI ships as `fastapi_*.md`, K8s
ships as `k8s_*.md`, and golden dataset `expected_sources` are
filename stems. The path-preserving alternative (recursive `rglob`
plus relative-path source IDs, e.g., `concepts/workloads/pods`) was
evaluated during the K8s pilot planning and explicitly deferred. The
root-cause refactor would have required FastAPI re-ingestion and a
rewrite of the FastAPI golden dataset's `expected_sources` β€” trading
certain regression risk on a green baseline (288 tests, citation
accuracy 1.00 on API providers) for speculative legibility benefit
on K8s authoring.
The `source_pages` field on `GoldenQuestion` preserves the
human-readable path anchor separately from the machine identifier,
so the deferral does not lose information. Authors see both
`expected_sources: ["k8s_pods.md"]` (what the evaluator matches on)
and `source_pages: ["concepts/workloads/pods"]` (where the content
came from on kubernetes.io) in the same question record.
**Pattern marker, not a promise.** This is the second visa-timeline
deferral of a root-cause refactor in favor of a minimal-blast-radius
fix; the first was the Mar 25 β†’ Apr 12 P@5 slide bisection. Both
deferrals were deliberate, not forgetting. Not scheduled until
post-launch; marker only. Post-launch scope: modify `ingest.py` to
`rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
golden datasets' `expected_sources` to path-style. Estimated 3h.
## K8s refusal_threshold empirical calibration β€” 0.02 β†’ 0.015
**Change.** `configs/default.yaml`, `corpora.k8s.refusal_threshold`:
`0.02` β†’ `0.015`. Single-line config change, pilot-corpus only.
FastAPI threshold unchanged.
**Empirical evidence.** Diagnostic instrumentation of `k8s_pilot_005`
(*"How do I configure a Kubernetes NetworkPolicy to enforce mutual
TLS (mTLS) between Pods in the same namespace?"*) captured the
retrieval gate firing at `max_score = 0.01639344262295082` β€” exactly
`1 / (60 + 1)`, the algebraic floor for a single rank-1 BM25 hit
under RRF with `rrf_k = 60`, dense contribution zero. At
`refusal_threshold = 0.02`, pilot_005 tripped the gate and short-
circuited before retrieval chunks reached the agent. At
`refusal_threshold = 0.015` (one tick below the measured floor), the
gate releases and retrieval proceeds. The 0.015 value is not a
tuning guess β€” it is the nearest round-number floor below the
observed gate-fire value for the single worst pilot in the set.
**Validation.** `results/k8s_preedit.json` captures the full 6-pilot
run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean
`tool_calls_made` 1.167. All six questions receive retrieval; no
gate-fire short-circuits. pilot_005 still refuses as a separate
downstream issue (see next entry when the counterfactual-query fix
lands); that is not a threshold problem.
**Scope of this commit.** K8s only. FastAPI `refusal_threshold`
(0.02) is not affected and FastAPI baseline is not re-measured.
Launch-intent `0.30` placeholder for K8s remains as a comment
marker; the full threshold sweep against the 25-question golden set
replaces 0.015 with a properly-tuned value in a later commit. 0.015
is the pilot-floor safety value, not the production-target value.
**Why this is a separate commit from the prompt revision.** The
threshold calibration is empirically grounded on its own β€” it
removes the 0.01639 gate-fire blocker, which is the precondition for
any downstream evaluation of pilot_005's actual agent behavior. The
prompt revision addresses a *different* failure mode surfaced once
the gate releases (agent search strategy is monotone positive-
framing). Two independent changes must not entangle in one commit;
if the prompt revision fails its regression gate and is reverted,
the threshold calibration should stand on its own empirical merit.
Feedback memory `feedback_fix_before_sweep.md` applies recursively:
fix measurement-affecting bugs at every layer before combining
fixes into single experiments.
## Prep for counterfactual-query prompt regression β€” pin, wire, tolerances
**Three sub-changes bundled as one prep commit, each small and in
service of making the downstream regression measurement valid.**
**1. OpenAI model pin.** `agent_bench/core/provider.py:208` changes
`self.model = "gpt-4o-mini"` β†’ `self.model = "gpt-4o-mini-2024-07-18"`.
The unpinned alias is a known drift vector β€” the Mar 25 β†’ Apr 12 P@5
slide bisection is an already-open parallel track item traceable to
silent alias migration. A regression run that uses the alias across
pre-edit and post-edit phases conflates prompt-clause effect with
model drift, even within a single session if the alias happens to
roll between runs. Pinning the dated snapshot removes the variable.
Pricing dict in `configs/default.yaml` gets a matching
`gpt-4o-mini-2024-07-18` entry so the cost-lookup at
`provider.py:209` still resolves. Tests that pin the model string
live in mock response payloads (not outgoing assertions) and the
langchain baseline (separate code path) β€” neither affected.
**2. FastAPI multi-corpus eval wiring.** `configs/default.yaml`
adds `corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json`.
The production serving path at `routes.py:105-120 _resolve_system_prompt`
already routes `/ask` and `/ask/stream` through `format_system_prompt(label)`
from `core/prompts.py` β€” the `app.state.system_prompt` legacy fallback
(serving/app.py:276) is effectively dead code given the shipped multi-corpus
config. The **only** remaining caller of `task.system_prompt` is the
`scripts/evaluate.py` legacy branch used by `make evaluate-fast`. Adding
the missing `golden_dataset` field makes `--corpus fastapi` work so the
regression gate can measure the actual production prompt path, not the
legacy eval-scaffolding prompt. Purely additive; zero blast radius on
serving (serving doesn't read `golden_dataset`).
**3. Pre-committed four-metric tolerances.** Written down now, before
the post-edit runs, so the pass/fail call on the counterfactual-query
prompt clause is not a judgment under confirmation-bias pressure.
Applied identically to FastAPI and K8s:
| Metric | Pass criterion |
|---|---|
| P@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
| R@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
| Citation accuracy | post-edit β‰₯ pre-edit (**hard gate** β€” any drop blocks commit) |
| Mean `tool_calls_made` | post-edit ≀ pre-edit + 0.30 |
| Individual question cap | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
**pilot_005 strict flip criterion (K8s-only):**
- `keyword_hit_rate β‰₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
- Answer cites `k8s_network_policies.md`
- Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
- Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")
**Baseline reference:** K8s pre-edit numbers from `results/k8s_preedit.json`
at commit `125dac0` β€” P@5 0.80, R@5 1.00, citation 1.00 (all 6),
mean tool_calls 1.167. FastAPI pre-edit reference established by
`results/fastapi_preedit.json` in the next step of this session,
same pinned ID, same refusal threshold (0.02).
**Rationale for bundling.** All three sub-changes answer "what must
be true before the regression measurement is valid" β€” drift control,
evaluation path, decision criteria. Splitting into three commits
would add noise without adding signal. None of them change the
prompt template itself; the prompt edit is the NEXT commit and is
the sole experimental variable the regression measures.
## Fix 1 (prompt-level counterfactual clause) attempted and reverted
**Outcome.** K8s regression clean on every metric (P@5, R@5, KHR,
citation, mean tool_calls all within tolerance or unchanged); K8s
pilot_005 flipped from refusal to documented-negative-with-citation
as designed (KHR 0.67 β†’ 1.00, answer contains both "service mesh"
and "ingress controller", cites `k8s_network_policies.md`).
**FastAPI regression failed** on the iteration-inflation tolerance:
mean `tool_calls_made` 1.111 β†’ 1.556 (delta +0.444, gate +0.30),
and two retrieval questions (q024, q025) were pushed from 1 pre-edit
tool call to 3 post-edit tool calls (hitting `max_iterations=3`
cap), violating the pre-committed "no new cap-hits from sub-cap
baseline" criterion.
**Correctness metrics on FastAPI all held.** Citation accuracy
stayed at 1.000 / 1.000 across all 27 questions. P@5 delta βˆ’0.007,
R@5 delta 0.000, KHR delta +0.006. The failure is purely process
inflation, not output regression. q024 and q025 produce identical
P@5/R@5/KHR/citation numbers pre and post despite the cap-hit β€” the
orchestrator's "max iterations hit β†’ one final complete() without
tools" path happened to keep answers correct, but that is
observation, not structural protection.
**Failure mode.** The clause's trigger condition β€” *"your first
search returned documentation about the subject of the question
without addressing the specific capability or feature the user is
asking about"* β€” relies on subjective LLM judgment about whether
retrieved content "addresses" a capability. The judgment is fuzzy
on compound multi-topic questions where the first search returns
partial-topic coverage. q024 asks about "Docker + Gunicorn workers
+ health checks + Pydantic Settings"; first search returns Docker
content, LLM reads "documentation about the subject without
addressing the specific capability," fires the follow-up with
negative framing, gets nothing useful, does a third normal search
to cover the remaining topics, hits the cap. Same pattern on q025.
Over-firing on this class of question is an inherent fragility of
prompt-level LLM-judged triggers; a wording refinement might
narrow the misfire rate but cannot eliminate it as long as the
judgment itself is fuzzy.
**q023 vs q024/q025 asymmetry is a useful signal for Fix 2.** q023
is a pre-existing 3-tool-call compound question ("custom error
handling + CORS middleware + structured testing with dependency
overrides"). Under the prompt clause, **q023 was unchanged** β€” the
clause did not fire on it β€” while q024 and q025, structurally
similar compound questions, were pushed into 3-tool-call cap-hit.
The difference is not in question structure but in how the LLM
interpreted the first-search return for each. That asymmetry is
the precise reason a deterministic trigger is the right next step:
any Fix 2 / Fix 3 candidate should be unit-testable against
`(pilot_005, q023, q024, q025)` β€” the right fix must fire on
pilot_005 and behave predictably on all three compound questions
(either fire on all of them or none of them, but not pick them
selectively by LLM whim).
**Gate discipline honored.** The pre-committed FastAPI tolerances
fired for exactly the reason the pre-commitment was designed:
catching process-metric regressions before they ship. Tolerance-
relaxation post-hoc would burn the session's strongest discipline
artifact (pre-committed-tolerances + honored-gate) for marginal
ship-this-approach EV. The narrow pilot_005 finding does not
evaporate with the revert β€” chunk 63 (`d0806d5da91d6026`) is real,
the negative-framing retrieval is reproducible, and Fix 2 will
surface the documented negative the same way via a deterministic
path.
**Fix 2 deferred to a later session.** Deterministic query
expansion at the `SearchTool` layer: when a `search_documents`
call returns no chunk containing a direct answer string, issue a
second internal search with negative-framing keywords and merge
results before returning to the orchestrator. Offline-testable,
corpus-agnostic, no LLM judgment required, no iteration-budget
impact (the double-search happens inside a single tool call, not
across iterations). Unit-testable against the
`(pilot_005, q023, q024, q025)` asymmetry as an acceptance fixture.
**Evidence retained.** Four result JSONs in `results/` document the
regression measurement at the pinned `gpt-4o-mini-2024-07-18`
snapshot in this session:
- `fastapi_preedit.json` β€” 27 questions, HEAD prompt, 0.02 threshold
- `fastapi_postedit.json` β€” 27 questions, clause prompt, 0.02 threshold (**gate-failing run**)
- `k8s_preedit_pinned.json` β€” 6 pilots, HEAD prompt, 0.015 threshold
- `k8s_postedit.json` β€” 6 pilots, clause prompt, 0.015 threshold (**gate-passing run, pilot_005 strict flip confirmed**)
The previously-committed `results/k8s_preedit.json` (from `125dac0`)
is also a valid K8s-pinned measurement at the session-equivalent
snapshot and remains the canonical threshold-commit evidence.
**Held DECISIONS.md drafts stay held.** The counterfactual-query
finding draft (to be updated when Fix 2 lands) and the threshold-
calibration entry already committed at `125dac0` are both correct
in scope. The narrowed serving-migration deferral entry (tied to
any external reference to the counterfactual-query fix) also stays
deferred until Fix 2 lands, since the production/eval-harness
prompt divergence is unchanged by this revert.
## Fix 2 pre-committed regression gate β€” SearchTool deterministic query expansion
**Pre-committed BEFORE post-edit runs** (same discipline pattern
that caught Fix 1's iteration inflation cleanly).
**Mechanism under test.** `agent_bench/tools/search.py`
`SearchTool.execute` gains a deterministic two-query retrieval
path. When the primary retrieval passes the refusal gate, a
secondary retrieval is issued against an expanded query
(`original_query + " not supported limitations cannot"`), and the
final context returned to the LLM is `primary_top_3 ++
secondary_top_5` deduplicated by `chunk.id`. Both retrievals run
inside a single `SearchTool.execute` call β€” from the LLM's
perspective, the tool schema, name, parameters, and return shape
are unchanged, and the iteration budget is untouched.
**Why this is architecturally different from Fix 1.** Fix 1 placed
a behavioral clause in the system prompt that told the agent to
issue follow-up searches itself. The trigger was an LLM judgment
("did the first search return content addressing the specific
capability?") and the follow-up was a separate tool call, so it
counted against `max_iterations`. Over-firing on compound questions
inflated iteration counts and pushed q024/q025 to the cap. Fix 2
replaces this with a deterministic trigger (primary passes gate),
a fixed expansion suffix, and a merge that happens entirely inside
one tool call. No LLM judgment; no iteration change; corpus-
agnostic.
**Suffix choice.** `" not supported limitations cannot"`. Keyword-
dense, ungrammatical on purpose β€” the suffix exists to shift BM25
and embedding mass toward "what you cannot do" / "limitations"
sections, not to read well. The ungrammatical form is also a self-
documenting signal in retrieval logs: anyone reading a query trace
sees the suffix and immediately knows it is a synthetic expansion,
not user input. A one-line comment in `search.py` preserves the
rationale for future readers.
**Merge choice.** `primary_top_3 + secondary_top_5` deduped by
`chunk.id`, producing 5–8 unique chunks per call. Rationale: top-5
primary would make the expansion redundant on high-overlap queries
(defeating the mechanism), while primary-top-3 guarantees the
expansion always contributes to the final context window. Probe
data (`/tmp/probe_fix2_v2.py`, throwaway) confirms this merge
strategy surfaces pilot_005's target chunk
(`d0806d5da91d6026`, chunk_index 63, "Anything TLS related ... use
a service mesh or ingress controller for this") at position 6–8 in
the merged list.
**Opt-in flag, defaulting ON.** `SearchTool` accepts
`negative_framing_expansion: bool = True`. Default is the shipping
configuration because the regression gate must measure the shipping
behavior, not the no-op path. A `False` default would mean the gate
validates an unused parameter, and a subsequent commit flipping the
default would have no regression evidence. Kill switch is preserved
via explicit `False` at construction if a future regression
requires an A/B comparison.
**Baseline reuse.** The Fix 1 session's pre-edit JSONs
(`results/fastapi_preedit.json`, `results/k8s_preedit_pinned.json`,
both committed at `213da36`) were measured under the currently-
committed state of the repo: pinned `gpt-4o-mini-2024-07-18`, K8s
threshold 0.015, FastAPI threshold 0.02, HEAD `prompts.py` with no
clause, HEAD `search.py` with no expansion. The working tree
verification confirms this state is unchanged. These JSONs are
therefore reused as the Fix 2 pre-edit baseline and do not need to
be re-measured. Only post-edit runs are required for the Fix 2
regression (~$0.02 saved).
**Pre-committed tolerances.**
| Metric | Pass criterion |
|---|---|
| P@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
| R@5 | post-edit β‰₯ pre-edit βˆ’ 0.02 |
| Citation accuracy | post-edit β‰₯ pre-edit (**hard gate** β€” any drop blocks commit) |
| Mean `tool_calls_made` | post-edit ≀ pre-edit + **0.05** (design-correctness gate β€” see note) |
| Individual cap-hit | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit |
**Note on the tool_calls gate.** ≀ +0.05 is a *design-correctness*
gate, not a *performance* gate. Fix 2's invariant is that both
retrievals happen inside one `SearchTool.execute` call, so the
LLM's iteration count is unchanged by construction. Any non-trivial
movement in `mean tool_calls_made` indicates the design invariant
is broken β€” e.g., expansion accidentally exposed as a separate
tool, or the LLM observing two-call behavior and adapting its
strategy. The gate fires on design violation, not on performance
regression. The 0.05 absolute threshold absorbs legitimate run-to-
run variance from non-determinism in the LLM even at temperature
0, without absorbing real iteration-count movement.
**pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1
gate):**
- `keyword_hit_rate β‰₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
- Answer cites `k8s_network_policies.md`
- Answer contains "service mesh" OR "ingress controller"
- Answer does NOT begin with refusal phrasing
**Baseline reference for the gate.**
| Corpus | Pre-edit source | P@5 | R@5 | Citation | Mean tool_calls |
|---|---|---|---|---|---|
| FastAPI (27) | `results/fastapi_preedit.json` @ `213da36` | 0.585 | 0.679 | 1.000 | 1.111 |
| K8s (6 pilots) | `results/k8s_preedit_pinned.json` @ `213da36` | 0.800 | 1.000 | 1.000 | 1.167 |
**Post-edit filenames (to be produced).**
- `results/fastapi_postedit_fix2.json`
- `results/k8s_postedit_fix2.json`
**If the gate passes:** commit Fix 2 with `search.py` change, unit
tests (including the tool-spec snapshot test), the two post-edit
result JSONs, and this DECISIONS.md entry extended with the
regression outcome.
**If the gate fires:** revert, document the failure mode, surface
the specific criterion that fired. No tolerance relaxation β€” same
discipline pattern as Fix 1 revert.
## Fix 2 outcome β€” mechanism works, response-style criterion fired, reverted
**Regression runs produced.** Two post-edit runs on K8s (FastAPI not
run β€” K8s findings gated the decision before API spend on the
broader set):
| Run | Merge rule | File | Purpose |
|---|---|---|---|
| Fix 2 v1 | `primary[:3] + secondary[:5]` | `results/k8s_postedit_fix2.json` | Initial implementation |
| Fix 2 v2 | `primary[:5] + secondary[:5]` | `results/k8s_postedit_fix2_merge_v2.json` | Path A refinement after v1 failed P@5 on a metric-definition mismatch |
**v1 findings.** Aggregate: P@5 0.800 β†’ 0.767 (Ξ” βˆ’0.033, **FAILED**
the P@5 β‰₯ βˆ’0.02 tolerance). The failure traced to a merge-rule /
metric-semantics interaction: `retrieval_precision_at_k` computes
precision on `retrieved_sources[:5]`, and with `primary[:3] +
secondary[:5]` the first 5 entries were `primary_top_3 +
secondary_top_2`. For pilot_005, `secondary[1]` was
`k8s_pods.md` (chunk_index 40, surfaced because the reranker
matched its "localhost communication" content against the expanded
query). That single off-source chunk in position 5 dropped P@5
from 1.00 to 0.80 for pilot_005 and similarly for pilot_006.
Iteration invariant held (tool_calls 1.167 β†’ 1.167). Citation
accuracy held (1.000 β†’ 1.000). Target chunk
(`d0806d5da91d6026`, "Anything TLS related") reached the LLM
context for pilot_005 at merged position 7.
**Path A refinement (merge v2).** Change `primary[:3] +
secondary[:5]` β†’ `primary[:5] + secondary[:5]`. Rationale:
primary_top_5 is preserved in positions 1–5 by construction, so
P@5 computed on `ranked_sources[:5]` is unchanged from the
no-expansion baseline. Expansion chunks land in positions 6–10.
Target chunk still reaches LLM context (position 9 for pilot_005).
This is an **implementation refinement, not a tolerance
relaxation** β€” the pre-committed gate thresholds stand; only the
merge rule was adjusted to respect the metric's window semantics.
**v2 findings β€” perfect metric preservation, but strict-flip fails on response style.**
Aggregate:
| Metric | Pre-edit | Fix 2 v2 | Delta |
|---|---|---|---|
| P@5 | 0.800 | 0.800 | **0.000** |
| R@5 | 1.000 | 1.000 | 0.000 |
| KHR | 0.806 | 0.806 | 0.000 |
| Citation accuracy | 1.000 | 1.000 | 0.000 |
| Mean `tool_calls_made` | 1.167 | 1.167 | **0.000** |
Every aggregate metric **literally unchanged**. Per-question
deltas: zero on every metric, every question. The design
invariant (iteration budget unchanged, tool schema unchanged,
refusal gate behavior unchanged) holds perfectly.
**But pilot_005 strict flip fails on the refusal-phrasing criterion.**
Post-edit answer:
> *"The Kubernetes documentation does not provide specific
> instructions on configuring a NetworkPolicy to enforce mutual TLS
> (mTLS) between Pods in the same namespace. For mTLS, it is
> generally recommended to use a service mesh or other proxy
> solutions, as NetworkPolicy alone does not handle TLS
> configurations directly [source: k8s_network_policies.md]."*
The answer substantively contains the documented negative with
citation. But it opens with *"The Kubernetes documentation does
not provide specific instructions..."* β€” the exact refusal-
phrasing opener the strict-flip criterion was pre-committed to
reject. The criterion exists because the brand is honest
evaluation: an answer that opens apologizing that the
documentation "does not provide specific instructions" reads, to
a technical reviewer, like the system failed to find the answer
and is papering over the gap, even though the facts and citation
are present. The criterion fired as designed.
**Compare to Fix 1 post-edit answer (from `213da36` evidence):**
> *"Kubernetes NetworkPolicy does not support enforcing mutual TLS
> (mTLS) directly. The documentation states that anything TLS
> related should be handled using a service mesh or ingress
> controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."*
Fix 1's answer asserts a fact about **NetworkPolicy** ("does not
support"); Fix 2's answer asserts a fact about **the documentation**
("does not provide instructions"). The first forecloses the
capability; the second leaves open whether the capability exists
somewhere the system didn't see. That distinction is load-bearing
for any grounded-refusal narrative, and it separates a system that
handles documented negatives crisply from one that hedges around
them.
**Diagnosis.** Fix 2's mechanism successfully gets the target chunk
into the LLM's context window β€” the retrieval side of the problem
is solved. What Fix 2 **cannot provide** is explicit guidance on
how to phrase the documented negative once the chunk is present.
Fix 1's prompt clause was doing that guidance work; removing the
clause and relying on the LLM's unaided response style produces a
hedging answer because the LLM, seeing both NetworkPolicy-spec
content and a TLS limitation bullet, defaults to contextual
hedging rather than crisp assertion.
**Fix 2 is therefore not an alternative to Fix 1's prompt clause
β€” it is a prerequisite.** Fix 2 guarantees the chunk reaches
context; a future "Fix 2 + targeted prompt clause" stack could
resolve both the retrieval gap and the response-style gap without
Fix 1's over-firing problem, because the clause would no longer
need to direct the agent to do a follow-up search (Fix 2 handled
that). The over-firing on compound questions that broke Fix 1 was
caused by the agent deciding to do extra search iterations under
LLM judgment; if the expansion already happened deterministically
inside the first tool call, the clause has less work to do and
may not trigger the second-LLM-call pattern at all. **Speculative
and not for this session.** Future work item.
**Gate verdict: failed on pilot_005 strict flip criterion.**
Reverting, same Fix-1 pattern.
**What this commit contains.**
- `agent_bench/tools/search.py` **reverted** to HEAD (no Fix 2
code changes)
- `tests/test_tools.py` retains the `MockChunk.id` hygiene fix
(the real `Chunk` class has `id`; mock should match the real API
for future test authors)
- `tests/test_tools.py` adds `TestSearchToolSpecSnapshot`: a
general-purpose guard that freezes `SearchTool`'s LLM-facing
contract (name, description, parameters). The lesson from Fix 2
is that any future refactor exposing internal SearchTool state
to the LLM would break iteration-budget invariants β€” the
snapshot test catches that at test time, independent of whether
Fix 2 lands.
- Two regression evidence JSONs: `results/k8s_postedit_fix2.json`
(v1, the P@5 failure) and `results/k8s_postedit_fix2_merge_v2.json`
(v2, the strict-flip failure). Retained as the measurement
trail behind the revert decision.
- This DECISIONS.md entry (pre-committed gate + outcome + revert
narrative).
**What this commit does NOT contain.** No changes to
`agent_bench/tools/search.py`, `agent_bench/core/prompts.py`, or
`configs/default.yaml`. Both Fix 1 (prompt clause) and Fix 2
(SearchTool expansion) have been attempted and reverted this
session. Three commits of progress nonetheless: `125dac0`
(threshold calibration, empirical), `5c1f49f` (prep bundle: model
pin + fastapi wire + Fix 1 pre-committed tolerances), `213da36`
(Fix 1 revert narrative). The threshold calibration and model pin
are real, shipped, measurement-grounded infrastructure changes.
The two fix attempts are documented learning that shapes the
future direction.
## `grounded_refusal` metric reads answer text, not retrieved sources β€” 2026-04-14
**Context.** Week 1 step 5 authoring (25-question K8s golden set). Two
flavor-A out-of-scope questions (`k8s_004` Jaeger sidecar, `k8s_024`
Envoy xDS ADS) surfaced a pre-existing bug in the
`grounded_refusal` metric during the functional check.
**Bug 1 β€” wrong signal.** The metric's docstring said it checks
whether the answer correctly refuses AND cites no sources, but the
implementation was checking `len(response_sources) == 0` where
`response_sources` is the *retrieved*-sources list. Real agents
retrieve candidates on any non-trivial OOS query (the grounded-refusal
gate at tool level only catches the thinnest queries), inspect the
candidates, find nothing relevant, and refuse *in the answer text*
without citing anything. Checking retrieval emptiness flagged those
correct refusals as failures. Fix: inspect the answer text for
`[source: X.md]` citations via regex; drop the `response_sources`
parameter from the signature entirely.
This was a silent false negative on all 5 fastapi out-of-scope
questions (`q008`–`q010`, `q026`–`q027`) which all correctly refuse
but were being marked `grounded_refusal=False`. Aggregate
`refusal_rate` in `report.py` shifts by the resulting 5-question
delta; any historical comparison to pre-fix fastapi numbers needs
to acknowledge this.
**Bug 2 β€” metric coverage gap surfaced during 25-question authoring.**
`grounded_refusal_rate` recognized "does not contain information"
phrasing (in `refusal_phrases` list) but missed "not in the
{corpus_label} documentation" phrasing β€” the exact shape taught by
the system prompt at `core/prompts.py:17-18`. The LLM produced the
canonical form on some questions and the phrase-list form on others;
the metric inflation/deflation was non-deterministic. Fix: narrow
regex `\bnot in the\b[^.]{0,60}\bdocumentation\b` added alongside
phrase-list matching.
**Rejected alternative.** Substring `"not in the"` would produce
false positives on valid-answer phrasing β€” "the rate limit is not in
the same scope as the request timeout", "the flag is not in the 1.28
release; it landed in 1.29", "this value is not in the default
range" β€” all of which are legitimate retrieval answers with
conditional or scope-limiting language, not refusals. Honest
evaluation cannot afford a metric that silently counts these as
grounded refusals.
**Tests.** Two unit tests pin both directions:
`test_canonical_refusal_phrasing_recognized` covers the positive
case ("The answer is not in the Kubernetes documentation"), and
`test_not_in_the_is_not_substring_refusal` covers the negative case
("The rate limit is not in the same scope as the request timeout").
The negative test is the load-bearing one β€” without it, a future
refactor could silently widen the matcher back to substring and pass
all existing tests. The negative test pins design intent.
**Scope bound.** This is a metric correctness fix, not a threshold
change. The 0.015 refusal-gate threshold (calibrated in `125dac0`
against the 6-question pilot) is unchanged by this commit. Whether
the corrected metric shifts the optimal threshold against the full
25-question set is a question for the threshold-sweep session, not
this authoring session.
## Parallel tracks / deferred items β€” 2026-04-14
Tracked list of work items that are deferred to parallel sessions.
Each item has a reason for deferral and a rough scope boundary so
the session that picks it up has the context to pre-commit tolerances
and decision criteria before measuring.
1. **`routes.py:552` audit-logger semantics unification.** The
serving layer's audit record field still uses the pre-fix
`grounded_refusal = not bool(sources)` expression, which disagrees
with the evaluation metric's answer-text-based definition. Not
surfaced to the dashboard (audit log only), but external reviewers
who reference audit records for runtime verification would see a
different definition than the benchmark claims. Fix: call
`grounded_refusal(answer, category)` from `metrics.py` directly.
When this lands, the "grounded_refusal metric" DECISIONS.md entry
above should get a one-line addendum noting the unification.
2. **Full 25Q threshold sweep β†’ production-target `refusal_threshold`
for K8s.** The 25Q set exists, the metric is correct. Sweep
against the full set, compare to pilot-floor 0.015, pick the
production-target value, update `configs/default.yaml` placeholder
comment. Pre-commit before measuring: sweep range, decision
criteria, tolerances. Do not entangle with flavor-B response-style
work below β€” those are independent axes.
3. **Flavor-B response-style class (pilot_005 + k8s_022).** Two
independent reproductions of "LLM refuses when documented negative
is in retrieved context". Retrieval is healthy on both; the gap
is prompting. Future session: Fix 2 (counterfactual-query
expansion in `SearchTool`) + targeted prompt clause stacked β€”
previously speculative in the Fix 2 revert entry, now addresses
a documented reproducible class. Two reproductions, not one-off.
4. **Serving-migration deferral.** Tied to external references to
the counterfactual-query fix. Unchanged from prior sessions.
5. **`agent-bench` β†’ `refusal-bench` rename β€” CLOSED 2026-04-14.**
Decision: keep `agent-bench`, reframe via tagline. The original
concern was name collision with AgentBench (Liu et al., ICLR
2024, ~1000 citations). Due-diligence at launch time: the name
is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase),
which are distinct identifiers across GitHub, arXiv, and PyPI.
The two projects target different audiences (LLM-as-agent
capability vs. RAG+refusal benchmark) and any reviewer reaching
the repo via LinkedIn or CV sees the scope in the README within
seconds. Rename cost is substantial (~350 internal references
across ~60 files, two external account renames, one HF Space
URL break with no redirect) for a naming-precision benefit that
isn't supported by the actual scope β€” the benchmark measures
retrieval, grounding, multi-hop, citation accuracy, and refusal
as seven axes, not refusal alone. Tagline reframe captures the
honest-evaluation positioning without the rename cost:
> "A RAG benchmark built from primitives, with honest
> evaluation of retrieval, refusal, and grounded citation."
HF Space rename (`Nomearod/agentbench` β†’ `Nomearod/agent-bench`
for GitHub-name consistency) is a separate, smaller follow-up
deferred approximately one week. Reason: several job
applications submitted the preceding week reference the current
HF URL (`nomearod-agentbench.hf.space`); renaming the Space now
would break those inbound links with no HF-side redirect. The
rename absorbs cleanly once the application wave lands and the
reference window expires. Until then the README, dashboard, and
DECISIONS.md continue to reference the current `agentbench` URL;
launch-adjacent work (Post #1, screenshots, cold-start measure)
uses the current URL and will be updated in a single small
follow-up commit when the rename happens.
6. **OpenAI snapshot drift bisection.** Mar 25 β†’ Apr 12 P@5 slide;
the model pin at `5c1f49f` (`gpt-4o-mini-2024-07-18`) removed
the ongoing drift risk, so any future measurement is apples-to-
apples. The original bisection is still unresolved but cheap at
this point β€” tractable whenever there is session capacity, low
urgency because the pin protects forward runs.
7. **Fix 2 revert commit SHA missing from the Fix 2 outcome entry.**
The "Fix 2 outcome β€” mechanism works, response-style criterion
fired, reverted" DECISIONS.md entry describes the revert
narratively but does not cite the revert commit's SHA
(post-rewrite: `27c2e17` β€” `docs(eval): Fix 2 SearchTool query
expansion β€” attempted and reverted`). Add retroactive SHA
reference in the next docs pass. Not urgent; noted so the
narrative-without-SHA pattern does not spread to other entries.
**Lesson going forward:** prefer explicit SHAs over positional
references like "this commit" / "commit above" in DECISIONS.md
entries β€” positional references do not survive history rewrites
as robustly as SHA references do.
## K8s refusal_threshold sweep against 25-question golden β€” 2026-04-14
**Override notice.** This sweep ran in the same session as the
25-question authoring + grounded_refusal metric fix (`4454894`),
after I explicitly flagged that the parallel-tracks guidance from
earlier in the session recommended waiting for a fresh session with
pre-commitment discipline. The user issued an explicit override:
"proceed on best-judgment sweep range and criteria" β€” logged here
for audit trail. The pre-commitment frame below was drafted BEFORE
running any sweep value, not after. The decision criteria were
locked before the first data point was observed, not retrofitted.
**Sweep grid.** 4 threshold values: `0.010`, `0.015` (already
measured in `.cache/eval_k8s_full25_postfix.json`, the post-metric-
fix run from `4454894`), `0.020`, `0.025`.
- `0.010`: one tick below current calibration; sanity-check floor.
- `0.015`: current calibration (pilot-floor, one tick below
pilot_005's 0.01639 max_score).
- `0.020`: matches legacy FastAPI threshold and the original
provisional K8s default before the `125dac0` calibration.
- `0.025`: one tick above legacy; exploration of whether aggressive
OOS short-circuiting is worth the correctness risk.
**Decision criteria (pre-committed).**
1. **OOS refusal must hold.** Both `k8s_004` (Jaeger) and `k8s_024`
(Envoy xDS) must retain `grounded_refusal=True` at the chosen
threshold β€” whether the gate fires at the tool level or the
LLM refuses after inspecting context doesn't matter, only that
the metric reports True.
2. **Retrieval recall must not degrade.** Each retrieval-category
question's R@5 at the chosen threshold must be β‰₯ its R@5 at
`0.015` (the post-fix-25Q baseline) with a noise tolerance of at
most ONE question dropping by at most 0.20. Two or more drops,
or any drop > 0.20, disqualifies the value.
3. **Citation accuracy must hold.** All questions' citation_accuracy
must be β‰₯ 0.95 at the chosen threshold. One question at 0.80 is
noise-tolerated; two or more is a hard stop.
4. **k8s_022 (flavor-B) retrieval must remain at R@5=1.0.** The
gap is prompting-side, not retrieval-side; any threshold that
breaks the already-working retrieval on flavor-B questions is
a regression.
5. **Pick the highest threshold that satisfies 1–4.** Rationale:
a higher threshold short-circuits more OOS queries at the tool
level, saving a retrieval round trip and an LLM call β€” this is
a real latency and token-cost win when the correctness is held.
6. **Tie-break.** If multiple values all satisfy 1–4, prefer the
value closest to a clean round number (0.020 over 0.018) for
documentation clarity.
7. **Floor.** If no threshold > 0.015 satisfies 1–4, keep 0.015.
No threshold < 0.015 will be chosen regardless β€” sub-0.015 is
strictly less protective than the pilot-floor.
**Scope bound.** K8s only; FastAPI's `refusal_threshold: 0.02` is
unchanged. The flavor-B response-style gap (parallel track #3) is
NOT a sweep variable β€” changing the threshold does not fix LLM
phrasing; that's the Fix 2 + prompt guidance stacked experiment
the parallel-tracks list already defers.
**Measured results.** All four runs use the post-metric-fix pipeline
(grounded_refusal metric from `4454894`), deterministic mode,
`gpt-4o-mini-2024-07-18`, same retriever config.
| threshold | avg R@5 | OOS refusal | gate fired on | broken retrieval |
|-----------|---------|-------------|-----------------------------------|------------------------|
| 0.010 | 0.957 | 2/2 | β€” | β€” |
| 0.015 | 0.957 | 2/2 | β€” | β€” |
| 0.020 | 0.870 | 2/2 | k8s_006, k8s_007, k8s_024 | k8s_006, k8s_007 (R@5=0.00) |
| 0.025 | 0.913 | 2/2 | k8s_004, k8s_007, k8s_024 | k8s_007 (R@5=0.00) |
**Structural finding: LLM query variance makes max_scores non-deterministic.**
At 0.020, `k8s_006` (ConfigMap, simple) gate-fired β†’ empty retrieval β†’
R@5=0.00. At 0.025, `k8s_006` did NOT gate-fire β†’ 5 sources β†’ R@5=1.00.
A higher threshold producing fewer gate-fires is physically impossible
if retrieval is deterministic β€” the SearchTool receives different
queries across runs because the orchestrator issues LLM-generated
queries, and the same question can produce different top-k max_scores
run-to-run. `k8s_006`'s max_score for the query the LLM chose lives
somewhere around the 0.018–0.025 boundary; which side of any given
threshold it lands on depends on which query the LLM wrote.
This means **any threshold above 0.015 is structurally fragile**, not
merely "failed on this run." Even if a run at 0.018 passed, a future
run could gate-fire on `k8s_006` or `k8s_007` because the query is
non-reproducible. The production threshold needs to sit below all
legitimate simple-question max_scores with enough margin to absorb
LLM query variance.
**Decision: keep `refusal_threshold: 0.015`.**
- `0.010`: meets all criteria, identical measured metrics to `0.015`
(avg R@5=0.957, OOS refusal 2/2, no citation fails). Not chosen:
lowering strictly weakens the gate's ability to catch low-
confidence retrievals without improving any measured metric.
- `0.015`: chosen. Meets all criteria and is the highest value that
does not degrade retrieval β€” which is the definition of the
correct refusal-gate threshold. Preserving the gate's signal is
the gate's purpose; `0.015` gives maximum gate strength without
cost, `0.010` gives the same measurable behavior with less gate
signal, so `0.015` dominates.
- `0.020`: breaks TWO retrieval questions (`k8s_006`, `k8s_007`);
disqualified per criterion 2.
- `0.025`: breaks ONE retrieval question in this run (`k8s_007`)
but the non-determinism finding means a future run could break
more. Even ignoring non-determinism, still disqualified by the
citation-accuracy-equivalent drop on `k8s_007`.
**Corpus characteristic finding.** The 0.020 default inherited from
FastAPI breaks on K8s because K8s retrieval score distributions are
lower for "easy" questions. `k8s_006` ("What is a ConfigMap?") and
`k8s_007` ("What does a Kubernetes Job do?") are both `type: simple`
with clean single-source expected answers β€” exactly the cases where
BM25+embedding scores should be highest. They land at max_scores in
the ~0.018 range, below the FastAPI-calibrated 0.020 default. This
is **not an authoring bug** β€” both questions retrieve their
`expected_sources` correctly when the gate doesn't fire. It's a
corpus characteristic: K8s documentation has more topic-overlap
across pages than FastAPI, diluting top-k concentration.
The 25-question set exposed this because the 6-question pilot had
no simple questions with low max_scores β€” the pilot was drawn from
retrieval-stressful areas (comparison, multi-hop, flavor-B). The
25-question authoring deliberately added simple questions to hit
the CRAG distribution target (6 simple, 5–6 target), and those
simple questions revealed the corpus-characteristic floor.
**Config change.** `configs/default.yaml` `corpora.k8s.refusal_threshold`
comment updated to reference this sweep. Value unchanged at `0.015`.
**Not in scope.** (a) Adding retry-with-query-variance to the
SearchTool to reduce max_score variance β€” separate session, affects
other corpora. (b) Tuning FastAPI's threshold against its golden
set β€” the FastAPI default was empirically fine on its own 30Q set
and is not a documented regression. (c) Fixing the `k8s_015`
R@5=0.50 value observed across all threshold runs β€” pre-existing
authoring state from `4454894`, tracked separately if it becomes
a concern on future runs.
**Narrative summary.** Session hypothesis: pilot_005 is a
counterfactual-query-expansion problem. Session evidence: the
hypothesis is correct on retrieval β€” the target chunk is reachable
via negative-framing queries and Fix 2 surfaces it deterministically
with zero iteration-budget impact. Session evidence also shows the
hypothesis is **incomplete** β€” retrieval-only fixes cannot close
the response-style gap, because the LLM under unaided prompting
hedges when a documented negative is surrounded by unrelated
topical content. A future session exploring **Fix 2 + targeted
prompt guidance stacked** is the natural next experiment; this
session's pilot-first discipline has been preserved against two
distinct pre-committed gates, both firing for the reasons they
were designed to catch.
## Credential-exposure incident and history rewrite β€” 2026-04-14/15
**Summary.** During Week 1 work on the
`feat/user-friendly-landing-page-live-dashboard` branch, an
`instruction.txt` file containing plaintext OpenAI and Anthropic
API keys was accidentally committed at pre-rewrite SHA `2b3150f`
(`style: fix ruff lint β€” import sorting, line length`) and removed
from the working tree in a later commit (pre-rewrite SHA `3a2c5ef`,
`security: remove instruction.txt containing plaintext credentials`).
The removal did not clean git history β€” the keys remained accessible
via `git show 2b3150f:instruction.txt` in local history.
**Discovery.** The issue was discovered when GitHub push protection
rejected the first push of the branch to the `origin` remote,
flagging the credentials via its secret-scanning system. The branch
had never been pushed to any public remote prior to the rewrite;
the detection fired on the very first push attempt, which is the
correct moment for secret-scanning to act. Honest credit to the
tooling: GitHub's push protection did exactly what it was designed
to do, and the alternative failure mode (silent push of real
credentials to a public repo) did not occur.
**Immediate actions, in order.**
1. **Key rotation.** Rotated both OpenAI and Anthropic keys at the
respective provider dashboards, revoking the exposed values
immediately. Rotation was confirmed before any git operation
ran β€” the reasoning was that the keys were exposed on the local
disk regardless of whether they ever made it to a public remote,
so the exposure window needed to be closed first.
2. **Unauthorized-use check.** Verified billing/usage dashboards on
both OpenAI and Anthropic for the exposure window (from commit
`2b3150f` landing until rotation). No unauthorized activity
observed on either account.
3. **Local `.env` update and smoke test.** Updated local `.env`
with the new keys. Verified both worked via minimal API calls
that return only HTTP status codes (never the key values
themselves): `GET /v1/models` for OpenAI (200), `POST /v1/messages`
with a 1-token request for Anthropic (200). Total verification
cost: <$0.0001.
4. **Repository backup.** Before running any history-rewriting
command, backed up the entire repository via `rsync -a` to
`/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>`,
excluding only `.mypy_cache` and `.cache` (both derivative,
regenerable, and explicitly `.gitignore`'d). The backup preserved
`.git/`, all four worktree state files under `.git/worktrees/`,
the `.worktrees/` checkouts themselves, and all tracked source
files. The backup is the safety net if the rewrite had gone
wrong in any way; this session never needed to consult it.
5. **History rewrite via `git filter-repo`.** Ran
`git filter-repo --path instruction.txt --invert-paths --force`
on the main clone. The `--force` flag was required because
filter-repo's default safety check refuses to run on non-fresh
clones; the backup step above mitigates the risk that this flag
is usually guarding against. 186 commits were parsed and
rewritten in ~2.4 seconds; filter-repo's internal repacking
completed in an additional ~5 seconds. The `origin` and `hf`
remotes were automatically unset by filter-repo as its standard
safety behavior (and restored from a saved file before the push).
6. **Dropped empty commit.** Pre-rewrite commit `3a2c5ef` (which
removed `instruction.txt` from the working tree but did not
clean history) became empty after filter-repo stripped the file
from all prior commits and was dropped automatically. This is
correct filter-repo behavior: the commit's only net effect was
to remove a file that no longer exists in any predecessor, so
post-rewrite it has no content change and is elided from the
linear history. The total commit count went from 186 β†’ 185.
Pre-rewrite SHA `3a2c5ef` maps to `00000...00000` in
`.git/filter-repo/commit-map`, indicating the drop. The dropped
SHA was not referenced anywhere in DECISIONS.md, so the drop
had zero audit-trail impact.
7. **Multi-layer verification sweep.** Ran six checks across every
location where the credentials could still be present:
(a) `git log --all --full-history -- instruction.txt` returned
empty; (b) `git rev-list --all --objects | grep instruction.txt`
returned 0 matches; (c) `git reflog --all` was empty after
`git reflog expire --expire=now --all`; (d) `git fsck
--unreachable` returned clean; (e) `git stash list` was empty;
(f) a precise key-value regex scan across all blobs in the
rewritten object database (`sk-[A-Za-z0-9]{30,}`,
`sk-ant-[A-Za-z0-9]{20,}`, and env-var-assignment patterns)
found 23 matches, **all verified to be non-secret content**
β€” specifically: 15 historical README.md blobs containing the
documentation placeholder `ANTHROPIC_API_KEY=sk-ant-...`
(with three literal dots), 7 historical `docs/provider_comparison.md`
blobs with the same documentation placeholder pattern, and 1
`tests/test_output_validator.py` blob containing test fixtures
that intentionally use mock key-shaped strings to verify the
output-validator's secret-redaction logic. The precise scan is
a meaningful check: it demonstrates that the exposure was
isolated to `instruction.txt` and did not spread via copy-paste
of the key values into other files before removal.
8. **Worktree walk.** All four worktrees (`feat-infra-sprint`,
`feature-grounded-refusal`, `langchain-baseline`,
`security-hardening`) were checked for `instruction.txt` history
pollution and for uncommitted changes. All four were clean β€”
no pollution in any branch's history (filter-repo operates on
all refs in a shared `.git/`, so the worktrees were reached
through the main clone's object database) and no local dirty
state in any working tree. No worktree deletion or recreation
was needed.
9. **DECISIONS.md SHA remap.** The filter-repo operation rewrote
every commit's SHA downstream of the first rewritten commit.
This broke every explicit SHA reference in DECISIONS.md because
those references pointed to pre-rewrite SHAs that no longer
exist. The remap used `.git/filter-repo/commit-map` as the
authoritative SHA-based mapping (not message-based pairing,
which would have been vulnerable to duplicate-message
ambiguity β€” 2 pairs of commits in the pre-rewrite history did
in fact have identical messages, though neither was in the
substitution set). Four unique old SHAs were remapped across
18 substitution sites:
| OLD (pre-rewrite) | NEW (post-rewrite) | Commit role |
|---|---|---|
| `bd2b913` | `213da36` | Fix 1 counterfactual prompt clause revert |
| `b97f00f` | `125dac0` | K8s refusal_threshold 0.02 β†’ 0.015 calibration |
| `77017db` | `5c1f49f` | pin gpt-4o-mini snapshot + wire fastapi golden |
| `526be18` | `4454894` | Week 1 step 5 β€” 25Q golden + grounded_refusal fix |
Every message matched exactly across the old→new pairing; no
new SHA prefix collides with any old SHA prefix; post-remap
grep confirmed zero remaining references to any old SHA.
**Exposure scope assessment.** The branch had never been pushed
to any public remote prior to the rewrite. The credentials existed
in:
- Local git history at `/Users/zenith/Desktop/agent-bench/.git/` (cleaned)
- Four worktree clones sharing the same `.git/` (cleaned via the main repo)
- The rsync backup at
`/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>`
(to be deleted after this commit and test suite confirm the
rewrite is correct)
No external exposure via GitHub, HF Spaces, or any other shared
system occurred. No cached CI artifacts contain the keys because
CI only runs on pushed branches and this branch was never pushed.
No forks or clones exist outside the local machine. GitHub's
push-protection detection itself touched the key strings during
the rejected push attempt, but GitHub's secret scanning is trusted
infrastructure and the rejection is the good outcome, not an
additional exposure event.
**Why this entry exists.** Credential hygiene failures are worth
documenting, not hiding. A reviewer who reads this entry sees a
developer who: made a mistake, caught it via automated tooling
working as designed, rotated keys before touching git, rewrote
history surgically with a backup as the safety net, verified the
rewrite across six independent checks, and preserved audit-trail
integrity through the SHA remap. The honest-evaluation brand
extends to credential-handling incidents β€” the alternative of
pretending this didn't happen, or silently unblocking the secret-
scanning rejection to push exposed values to a public repo, would
be a strictly worse outcome for both security posture and brand
credibility.
**Procedural lessons for DECISIONS.md going forward.** Prefer
explicit commit SHAs over positional references like "this commit"
or "commit above" β€” positional references do not survive history
rewrites as robustly as explicit SHAs do. The "Fix 2 outcome"
entry above was identified during this incident as missing an
explicit SHA reference to the Fix 2 revert commit (post-rewrite
SHA `27c2e17`); this is tracked as parallel-tracks item #7 for a
retroactive fix in the next docs pass.
### Round 2 β€” Google API key format in a test fixture
After the round-1 rewrite was complete and the feature branch had
been pushed to `origin` for the first time, GitHub secret scanning
raised a second alert (alert #1, `secret_type: google_api_key`)
against `tests/test_output_validator.py` line 152 at pre-round-2
commit `8ebe3964af7d` (`security: fail-closed on secret extraction
and env var leakage`). The alert was on a test fixture inside a
`@pytest.mark.parametrize` list, structurally consistent with the
other fake fixtures in the same list (OpenAI `sk-test123`,
Anthropic `sk-ant-xyz`, AWS `AKIAIOSFODNN7EXAMPLE`). The Google
fixture, however, was 35 chars after the `AIza` prefix and matched
both GitHub's detection pattern and the output validator's own
detection regex exactly.
**Disambiguation.** Asked whether the string was a hand-typed fake
or a real-leaked Google API key, the developer confirmed: (1) yes,
a Google API key had been created at some point in a GCP or
Google AI Studio context unrelated to this project, and (2) no,
the string on line 152 was not recognizably hand-typed. Combined
with the structural inconsistency against the other clearly-fake
fixtures in the same parametrize list, the safe interpretation
was to treat it as potentially real and rotate + rewrite rather
than dismiss as false positive.
**Actions, in order.**
1. **Google API key rotation.** All Google API keys on the
developer's GCP and Google AI Studio accounts rotated at the
provider dashboards, regardless of which specific key matched
line 152, because the specific match was not known with
certainty. Rotation confirmed before any git operation.
2. **Billing/activity check.** Verified Google Cloud billing and
API activity on every project for the window since commit
`8ebe3964af7d` landed (2026-04-12 18:18). No unauthorized
activity observed.
3. **Why the validator regex and GitHub's detector are identical.**
The output validator's regex at `agent_bench/security/output_validator.py`
line 23 is `\bAIza[0-9A-Za-z_\-]{35}\b` β€” byte-for-byte identical
to GitHub's secret-scanning Google API Key detection pattern.
This means there is no static test fixture that satisfies the
validator's test assertion (the validator must block the input)
without also triggering GitHub's push protection. Any replacement
with a fixture that matches the validator's regex is immediately
re-flagged; any replacement with a fixture that does not match
the validator's regex breaks the test assertion. The cleanest
resolution is to remove the Google fixture from the static
parametrize list entirely and restore Google API key format
coverage via a runtime-generated fixture that constructs a
35-char `AIza`-prefixed string at test time and never lands as
a literal in source code. Tracked as a parallel-tracks item.
The output validator's regex is NOT weakened; the test loses
one of seven parametrize cases but continues to verify OpenAI,
Anthropic, AWS, JWT, and env-var-assignment detection.
4. **Round-2 filter-repo.** Ran
`git filter-repo --replace-text <file> --force` with the pattern
file containing `regex:AIza[A-Za-z0-9_\-]{35}==>AIzaFIXTUREREDACTED`.
This replaced the Google API key format anywhere it appeared
in any historical blob across the entire repository. Every
commit from `8ebe3964af7d` forward was rewritten, which
cascaded through the full post-round-1 history including all
round-1-remapped SHAs and tonight's 5 commits. Total commits
processed: 186. filter-repo's internal commit-map wrote 152
changed entries and 35 unchanged entries (commits before
`8ebe3964af7d` that never touched the pattern).
5. **Working-tree fixture removal.** After the filter-repo rewrite,
`tests/test_output_validator.py` line 152 read
`"google says AIzaFIXTUREREDACTED"` (15 chars after `AIza`,
below the validator's 35-char regex threshold). Removed the
line entirely from the parametrize list and added a block
comment explaining the removal, the regex-collision reason,
the parallel-tracks item to restore via runtime-generated
fixture, and an explicit note that the validator's regex
remains unchanged. Committed as a separate new commit on top
of the rewritten history.
6. **Round-2 verification sweep.** Re-ran the same six-check
sweep: `git log`, `git rev-list --all --objects`, reflog,
fsck, stash, and a precise regex scan across all blobs for
the `\bAIza[0-9A-Za-z_\-]{35}\b` pattern. **Zero blobs** in
the post-round-2 object database contain a 35-char `AIza`
pattern. The scrub is complete across all history.
7. **Round-2 DECISIONS.md SHA remap.** The round-1 remap table
above uses SHAs `213da36`, `125dac0`, `5c1f49f`, `4454894`
as the "NEW (post-rewrite)" column. These are the
**post-round-2** SHAs; they were `e6d9675`, `c1d8163`,
`740c9d5`, `6d177ba` after round 1 and got rewritten again by
round 2. To avoid a three-column mapping table showing
intermediate round-1 SHAs, the table above reads as a direct
pre-rewrite β†’ current-state mapping. The round-1-only
intermediate SHAs are preserved in this narrative as
"round-1 SHAs" for audit completeness but are not the
canonical SHAs anyone looking up a commit should use. The
canonical SHAs are the post-round-2 values.
**Additional round-2 SHA update:** parallel-tracks item #7
(Fix 2 revert commit SHA missing from the Fix 2 outcome entry)
was updated from `8c836f5` (post-round-1) to `27c2e17`
(post-round-2).
**Exposure scope, round 2.** The branch had been pushed to origin
exactly once before round-2 was discovered (the first push at the
end of round 1, which landed commit `3167b59` at origin). The
feature branch was the only affected ref β€” `main` was not updated,
and no PR had been merged. The round-2 cleanup requires a
force-push with `--force-with-lease` to overwrite the pushed
round-1 history with the round-2 history. Force-push is normally a
discipline concern, but here it is safe: the branch was published
less than one hour before round-2 was discovered, no other work
was based on the pushed round-1 history, and the force-push is
scoped to this specific branch (not `main` or any long-lived ref).
**Alert dismissal.** GitHub alert #1 was dismissed as
`false_positive` via `gh api` after the force-push, with the
resolution comment noting that the pre-round-2 commit SHA the
alert referenced (`8ebe3964af7d`) no longer exists in the
rewritten history and the test fixture has been removed from
`tests/test_output_validator.py` pending a runtime-generated
replacement.
**Round-2 procedural lesson.** The validator-regex ↔ detector-regex
identity is a structural finding worth noting for future security
test design. Any test fixture that verifies detection of a
specific secret format will, by construction, match the format
it is testing. If the format is one GitHub (or any upstream
detector) also scans for, the fixture will trigger an alert on
every push where it is introduced. The three durable mitigations
are: (a) generate fixtures at runtime so they never land in source,
(b) use an isolated regex that is a proper subset of the production
detector's regex so fixtures fall below the detector's match
threshold, or (c) mark the file explicitly in a
`.github/secret-scanning.yml` allowlist. This project is adopting
option (a) as the follow-up, because it preserves the production
detector regex without weakening and keeps the test's fidelity to
the actual attack surface.
## Audit-path bug β€” streams masked a request-crashing failure (2026-04-15)
During v1 deploy smoke testing, every non-stream `POST /ask` request
and every injection-blocked request on the HF Space returned
`{"detail": "Internal server error"}` (HTTP 500) instead of the
intended 200 / 403. Normal queries via `POST /ask/stream` appeared to
work correctly from the dashboard, so the bug was invisible until a
direct curl hit a non-stream endpoint. Discovered and fixed same-day.
**Root cause.** `agent_bench/security/audit_logger.py:60` called
`self.path.parent.mkdir(parents=True, exist_ok=True)` the first time
a request wrote an audit record. The default path `logs/audit.jsonl`
resolved to `/home/user/app/logs/audit.jsonl` at runtime. The
Dockerfile's `WORKDIR /home/user/app` creates that directory as
**root:root mode 0755** because Docker's WORKDIR directive does not
honor `--chown`. Subsequent `COPY --chown=user` lines only change
ownership of the copied files, not the directory itself. At runtime
under `USER user` (uid 1000), the process had `r-x` on WORKDIR β€” it
could read and execute, but not create new subdirectories. First
audit write β†’ `PermissionError: [Errno 13] Permission denied: 'logs'`.
**Why `.cache/` worked and `logs/` didn't.** `.cache/` is created at
build time by the two `RUN python scripts/ingest.py ...` steps
(running as root) and then explicitly chowned by
`RUN chown -R user:user .cache/`. `logs/` was created lazily at
runtime by the audit logger, as a non-root user, in a directory owned
by root. Directory-creation permission wall.
**Why the streaming endpoint masked it.** `/ask/stream` calls
`_write_audit` at the *end* of the event generator
(`routes.py:438`), after all stage events and the final answer have
already been yielded over SSE. A failure there is already too late
to affect the client's view of the response β€” the client sees the
answer, the stream ends, and the audit entry is silently missing.
Non-stream `/ask` calls `_write_audit` synchronously before `return`
(line 263), so the failure propagates up through the middleware
exception handler and becomes a visible 500. Injection-blocked
requests on both endpoints also audit synchronously before returning
(lines 193, 302) and produce the same visible 500. Normal dashboard
use hits `/ask/stream` β†’ symptom invisible; smoke testing
non-stream `/ask` β†’ symptom obvious.
**Traceback confirmed from the HF Space runtime log:**
```
File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream
_write_audit(
File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit
audit_logger.log(record)
File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log
self.path.parent.mkdir(parents=True, exist_ok=True)
PermissionError: [Errno 13] Permission denied: 'logs'
```
**Two-fix patch.** Both applied in the same deploy cycle:
1. **Dockerfile (proximate).** Commit `ca34ccb`:
`RUN mkdir -p logs && chown -R user:user .cache/ logs/` at build
time, before `USER user`. Ensures the runtime user owns the
audit-log directory from container start.
2. **`AuditLogger.log()` (architectural, defense-in-depth).** Commit
`25e0f1b`: wrap the entire write body in `try/except Exception`,
log the failure via structlog as `audit_write_failed` with the
exception type and path, return normally.
**Why both, not either alone.**
- **Only fix 1:** unblocks the Space but leaves the architectural
bug β€” an audit logger that can crash the app is misdesigned
regardless of filesystem permissions. The SECURITY.md OWASP LLM10
writeup would be structurally weak.
- **Only fix 2:** stops the crashes, but audit writes would still
silently fail on HF Spaces because `logs/` remains unwritable. The
OWASP LLM10 claim would be *live-untrue* on the demo surface.
- **Both:** demo is honest (audit log actually written on HF Space),
architecture is sound (audit failures degrade gracefully), and the
OWASP LLM10 writeup has live evidence backing both claims.
**Verification on the live Space after the patch.** `GET /health`
flipped from `"degraded"` to `"healthy"` with `provider_available:
true`; `POST /ask` normal returned 200 with grounded answers citing
real sources from both corpora (was 500); `POST /ask` with the
`ignore_previous` injection pattern returned 403
`"Request blocked: potential prompt injection detected"` (was 500);
the runtime log shows `injection_detected pattern=ignore_previous
tier=heuristic` followed by a 403 status and no `audit_write_failed`
entries, confirming the audit write succeeded end-to-end.
**Methodology lesson β€” streaming endpoints can hide backend failures.**
Any request-path side effect (audit logging, metrics, cache write,
persistence) that runs at the end of a streaming generator will fail
silently from the client's perspective, because content is already
committed to the wire before the failure fires. Testing such side
effects against their failure modes *through* a streaming happy path
is not sufficient β€” the happy path masks the failure. Every
request-path side effect needs at least one test that verifies it
fails loudly (or, where the design says graceful, degrades loudly)
when its underlying dependency is broken, **exercised synchronously,
not via the streaming code path.**
For this project, `AuditLogger.log()` should gain a unit test that
asserts it returns normally (not raises) when its target directory is
unwritable β€” i.e., a regression test for the `try/except` fix above.
And the dependency injection of the audit logger in the route
handlers should get an integration test that exercises both the
streaming and non-streaming paths against the same failure injection,
to assert the streaming path does not hide what the non-streaming
path surfaces. Tracked as a v1.1 test-coverage follow-up.
**Why this wasn't caught locally.** Existing
`tests/test_audit_logger.py` tests use pytest's `tmp_path` fixture,
which is always writable by the test process. The failure mode only
manifests when the runtime user cannot create the log directory, a
deployment-environment condition not reproducible in a unit test
against a local filesystem owned by the developer. An integration
test that runs the Docker image end-to-end and hits `/ask` against
the actual container filesystem would have caught it pre-deploy.
Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
build infrastructure) but is the right long-term mitigation for this
class of bug.
## LLM-judge layer supersession β€” discrete-anchored 2-judge jury replaces continuous-score single-call
The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
and replaced by the per-dimension Judge layer at
`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
**Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
**Why this is a supersession, not a refactor.** The new layer differs from
the old on six axes: discrete-anchored scale (vs continuous 0–1),
reasoning-before-score JSON ordering (vs score-first), per-dimension
judges (vs combined faithfulness/correctness), full provenance per call
(judge_id + rubric_version + system_output_hash + prompt_seed; old had
none), composable variance wrappers (rubric_permute, jury β€” old was
single-call), and an intentional abstain-vs-raise discipline (vs silent
`None` from a bare `except Exception`).
**Evidence backing the supersession claim** β€” the calibration ΞΊ table
quantifies the new layer's agreement with hand-labels across 6 ablation
rows (baseline + 3 variance ablations + permute + 2-judge jury). The
files defending this entry's claim, by file path:
- `measurements/2026-05-04-judge-calibration-labels.jsonl` β€” 30 items Γ— 3
dimensions hand-labeled (UK AISI bio/chem ΞΊ ~0.8 cited as the
literature ceiling). Lands in Phase 10.
- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
`_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
`_permute.json`, `_jury_kappa_weighted.json` β€” per-row predictions.
Land in Phase 11.
- `docs/_generated/kappa_table.md` β€” generated ΞΊ ablation table copy-
pasted into the writeup. Lands in Phase 11.
- `docs/judge-design.md` β€” interpretive writeup with the closing
"when NOT to use LLM-judge" position. Lands in Phase 12.
**Config-knob preservation.** `evaluation.judge_provider` is unchanged
across all 5 YAML configs; new `evaluation.judge_dimensions` field
defaults to the three v1 dimensions. Zero user-facing config migration.
**Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
prompt optimization, citation_faithfulness in the default
judge_dimensions, AC2 sympy-derived parity tests.
## Opus stress-test surfaced groundedness rubric-scope drift before the ΞΊ ablation ran β€” 2026-05-05
The Opus stress-test pass over the 30 calibration items Γ— 3 dimensions
disagreed with the single-rater human gold on **22 of 30 groundedness
items** (8/30 agreement). Relevance and completeness agreed at 28/30 and
25/30 respectively. The groundedness disagreement is consistent in
direction β€” every disagreed-on item is `human=1, opus=0` β€” and has a
single root cause.
**Root cause: reference-scope drift between rubric author and labeler.**
`agent_bench/evaluation/rubrics/groundedness.md` defines the reference
scope as the gold snippets attached to each item:
> The judge sees only the gold snippets β€” not the retrieved chunks. A
> claim that happens to be true in the world but is not entailed by the
> snippets fails groundedness.
The single-rater notes on the disagreed-on items describe checking
against the broader documentation, not against `source_snippets`:
"supported by the corpus", "supported by the docs", "supported by the
provided dependency snippet". For items like `k8s_006` the gold snippet
is one sentence ("A ConfigMap is an API object used to store
non-confidential data in key-value pairs"), while the agent's answer
correctly synthesizes seven or eight additional claims from the full
`k8s_configmap.md`. Those claims are true in the world and well-supported
by the full doc, but **not entailed by the one snippet**. Opus applied
the strict-snippet rubric; the human rater applied a corpus-supported
rubric.
**Why this blocks `make calibrate` against the current gold.** The ΞΊ
ablation compares Haiku and GPT-4o-mini judges against the human gold.
A judge that correctly applies the strict-snippet rubric will disagree
with miscalibrated gold; a judge that's too lenient will agree. The
ablation rewards leniency and punishes rigor β€” the opposite of the
intended measurement. This is the same failure mode codified earlier in
this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
compensation when the measurement is wrong, not the intended effect.
**Why the rubric stays as written, not relaxed to "corpus-supported".**
Strict-snippet groundedness measures *RAG behavior*: did the agent
synthesize from what it retrieved? Corpus-supported groundedness
measures *LLM general knowledge passing through a RAG harness*: did the
agent happen to be correct? The first is what this benchmark is for;
the second is what `agent_bench/evaluation/metrics.py` measured before
supersession. Relaxing the rubric to "corpus-supported" would silently
re-introduce the failure mode the supersession entry above just removed.
**Decision β€” three-step correction lands before `make calibrate` runs:**
1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.**
Add an explicit reference-scope line and one anchored example
contrasting "supported by the snippet" vs "true in the world but
not in the snippet". Audit-trail requirement: the v1.1 writeup will
cite "rubric clarified between v1.0 and v1.1", and the git history
needs to back that claim.
2. **Re-label the 22 disagreed-on groundedness items** in
`measurements/2026-05-04-judge-calibration-labels.jsonl` against the
clarified rubric, snippet-only. **Do not mechanically copy Opus's
labels.** The labels remain the human single-rater's; what changes is
the rubric being applied. Mechanical copy would turn the ΞΊ table
into "judge vs Opus", which is not what the writeup claims it
measures.
3. **Recompute `make calibrate` against the corrected gold** and emit
`docs/_generated/kappa_table.md` from the v1.1 labels.
**Evidence files for the v1.1 writeup section:**
- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` β€” 90 Opus
labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min
org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
- `measurements/2026-05-04-judge-calibration-labels.jsonl` β€” original
v1.0 single-rater gold; will be diffed against v1.1 corrected gold to
quantify the re-label delta.
- `agent_bench/evaluation/rubrics/groundedness.md` β€” pre/post diff is
the rubric clarification.
**Pre-labeling observations also worth recording for the writeup
methodology section:**
- `q021` (fastapi Β· calculation) answered the CORS preflight question
correctly (600 / 60 = 10 minutes) with `sources: []` and
`ranked_sources: []` β€” the agent did the arithmetic without retrieval
and emitted an answer consistent with the snippet without having
retrieved it. Methodologically interesting for the
citation-faithfulness story (Block 2.7) if it ships: an answer can be
correct without being grounded-by-citation.
- `q025` (fastapi Β· multi_hop) answer was truncated mid-token by the
orchestrator's max_tokens limit. The labels reflect what the system
produced, not a mentally-patched complete version. The completeness
rubric does not currently anchor "truncated response" as a level β€”
v1.1 rubric work should add an anchor.
- Several K8s items embed external knowledge that's correct but not in
the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container
success; `k8s_009` describes Roles vs ClusterRoles by their semantics).
The clarified groundedness rubric should pick **strict** on this case
(claim must be supportable by the retrieved spans, not just consistent
with them) and the anchored example should show that ruling.
**Methodology framing for the writeup.** The Opus stress-test was added
specifically to catch hand-labeled-gold fragility before the ΞΊ table is
published. It caught it. The writeup's calibration section should
disclose the rubric clarification, quantify the re-label delta on
groundedness, and report ΞΊ against the v1.1 corrected gold β€” that is a
more credible story than a first-try clean ΞΊ table would have been.
**Outcome β€” 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows
ran cleanly after three coupled production-code fixes that landed on the
same branch as the rubric clarification: (1) markdown fence stripping in
`agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because
Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens`
512 β†’ 1024 because v1.1 anchored examples elicit longer model reasoning,
(3) calibration runner v1.0 omitted `item_id` from prediction records;
fixed in v1.1 with backfill of the 6 already-written row files via
`hash β†’ item_id` map (no re-spend). Probe-one-cell-before-sweep saved a
fourth $0.50 wasted run after the fence-strip change β€” the methodology
note in `feedback_judge_probe_before_sweep.md` was earned by this
session's two failed full-row attempts that paid ~$1.15 for unparseable
output before the diagnosis converged.
The ΞΊ table at `docs/_generated/kappa_table.md` (regenerated on
2026-05-05 with AC1 for groundedness and relevance, Cohen's ΞΊ for
completeness β€” see report.py `_DIM_METRIC`) shows three findings
that the writeup interprets rather than reports verbatim:
**v1.1 finding 1 β€” relevance is not "judges fail" territory.**
Cohen's ΞΊ = 0 across 5/6 rows is a prevalence degeneracy on the
29Γ—score=2 + 1Γ—score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00.
AC1 is the load-bearing statistic on relevance and groundedness; both
metrics agree on completeness where the gold (23Γ—2 / 5Γ—1) is balanced.
**v1.1 finding 2 β€” `no_cot completeness` agreement is real, not
selective abstain.** AC1 = ΞΊ = 1.000 at n=24. The 2 absent cells
(`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit
retry exhaustion), both gold=`2`, neither in baseline's disagreement
set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
(3Γ— gold=2 scored 1 by CoT-judge, 1Γ— gold=1 scored 2) flip to
agreement when CoT is removed. The interview-relevant claim is the
*opposite* of the conventional CoT-helps story: CoT-before-score on
3-point completeness lets the judge over-emphasize partial coverage
and rationalize `1` when the human gold sides with the holistic
"covers the points" reading.
**v1.1 finding 3 β€” `jury_kappa_weighted` underperformed baseline on
completeness, with a precise mechanism.** Per-member analysis from
`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
Haiku-4.5 alone reaches ΞΊ = 0.416 / AC1 = 0.792 / raw 84.6%;
gpt-4o-mini-2024-07-18 alone reaches ΞΊ = 0.020 / AC1 = 0.006 / raw
26.9% β€” systematically harsh on the 3-point scale, almost never
scoring `2`. Jury aggregate ΞΊ = 0.014 / AC1 = 0.016 / raw 26.9% β€”
matches gpt-4o-mini alone exactly because the jury verdict reduces
to gpt-4o-mini's verdict on every disputed cell.
The mechanism is *missing-weight + round-down* compounding, not
weighted voting in the usual sense. `scripts/run_calibration.py
::_load_weights_from_baseline` is a documented v1 stub that returns
weight = 1.0 for every judge_id present in baseline. baseline.json
contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
gets 1.0 from `jury.py`'s missing-key fallback (with a logged
`jury_missing_weight_fallback_to_one` warning per call). Equal
weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
of 1.5; the `_discretize_mean` rule is `frac > 0.5 β†’ ceil else floor`,
and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict
wins every disputed cell. The v1 design doc's risks subsection listed
"jury ΞΊ worse than the better individual judge β€” (a) kappa-weighting
wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
*both* branches simultaneously: branch (a) because the weighting is a
stub returning equal weights, and branch (b) because round-down at
exact 0.5 ties hands the verdict to the lower-scoring member.
The deeper structural point is that weighting alone cannot rescue a
systematically miscalibrated member. Even held-out validation that
correctly assigned gpt-4o-mini's true low weight on completeness
would still let it dominate disputed ties unless its weight were
driven near zero β€” and at that point exclusion is more honest than
near-zero inclusion. The conservative-on-binary "ties to lower" rule
also doesn't transfer cleanly to ordinal scales: on completeness,
"conservative" means scoring *toward incomplete*, which is precisely
the direction of gpt-4o-mini's bias.
**v1.2 fix list (four items, expanding the earlier two-item list):**
1. **Held-out jury weights.** Replace the
`_load_weights_from_baseline` stub with a real ΞΊ-derived
computation, evaluated on a *held-out validation set* β€” not the
same calibration row whose ΞΊ is being measured against the gold.
Closes the circular-weighting hole.
2. **Symmetric member coverage in the weights source.** Missing-member
fallback to weight = 1.0 amplifies an unweighted member rather than
suppressing it. Either every jury member must have a weight in the
source file or the run must abort. The `jury_missing_weight_
fallback_to_one` warning fired loudly on every call this run; in
v1.2 it should be a hard error.
3. **Per-dimension member exclusion when individual ΞΊ falls below a
threshold.** gpt-4o-mini at ΞΊ = 0.020 on completeness should not be
in the completeness jury at all. Weights below a floor (suggested
ΞΊ < 0.2) should be treated as exclusion, not as small-weight
inclusion. Held-out validation fixes circular weighting; it does
not fix systematic member bias.
4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule
(ties to lower) was selected for conservative behavior on binary
scales, where "conservative" means scoring 0 on uncertainty. On
3-point completeness, "conservative" means scoring toward
*incomplete*, which interacts badly with member miscalibration.
v1.2 should select the tie-break rule per-dimension based on the
rubric's conservative direction, not globally.
**Evidence files:** `docs/_generated/kappa_table.md` (regenerated with
AC1 for groundedness/relevance, ΞΊ for completeness);
`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
(per-member sidecar where the gpt-4o-mini completeness bias is
visible per item); `results/calibration_v1_judge_baseline.json`
(weights source β€” note the absence of any gpt-4o-mini-2024-07-18
entries, which is why the missing-weight fallback fires).
## v1.1 jury rescue β€” sharpened diagnostic + pre-committed A+B success criteria
**Date:** 2026-05-06. **Status:** in-flight; this entry is the pre-experiment
contract that pins down what counts as success before the re-aggregation
runs, so the outcome can't be negotiated post-hoc.
**Sharpened diagnostic β€” extraction-vs-reasoning split, not just "model is
biased".** Re-reading the per-member sidecar (item-level, not aggregate)
on the gpt-4o-mini completeness disputes shows a more specific failure
mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2
/ gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes`
field correctly extracts the paraphrased coverage from the agent answer
β€” and then its `reasoning` field denies that those very quotes constitute
coverage. k8s_002 is the cleanest instance: the model quotes the strings
"declarative updates" and "sticky identity" into evidence, then writes
"the answer does not explicitly mention 'declarative updates' and 'sticky
identity'". The score follows the reasoning, not the evidence. The
mechanism is that the model's *post-extraction reasoning step* applies a
literal-string-match standard to the answer text while the rubric
requires "paraphrase allowed" β€” i.e., the structured-output discipline
forced an extraction step that the reasoning step then contradicted on
autopilot. This is a known failure mode in chain-of-thought judges and
shows up more in smaller models because the reasoning step has less
capacity to integrate the rubric's instruction with the literal-text
comparison the model is running by default. The artifact for the writeup
is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three
side-by-side reasoning + evidence_quotes excerpts).
**Pragmatic v1.1 weights-source decision.** The v1.2 fix-list above
specifies a held-out validation set for jury weights β€” methodologically
clean but requires either splitting N=30 (loses statistical power on
both halves) or labeling more items (eats interview prep time). v1.1
chooses pragmatic: weights computed from the same calibration set used
for ΞΊ reporting, with the circularity flagged in the writeup. Reason:
(a) the alternative is splitting N=30, (b) the per-member ΞΊ values used
as weights are internally consistent, (c) v1.2 will use a held-out 20-
item set. The writeup will contain a sentence acknowledging the
circularity rather than hiding it.
**v1.1 elevated fix-list (subset of the v1.2 list above).** Items 2
(symmetric coverage / hard-error) is elevated unconditionally. Item 1
(real ΞΊ-derived weights) is elevated in pragmatic form (same set with
circularity caveat). Items 3 (per-dimension exclusion) and 4 (per-
dimension tie-break) remain v1.2 unless B's outcome forces them up.
**Pre-committed B success criteria.** Plan B is "re-aggregate the existing
164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl`
with corrected ΞΊ-derived weights, no new API spend." The outcome maps
deterministically to one of three predefined responses, picked *before*
B runs:
- **Outcome 1 β€” jury ΞΊ on completeness exceeds Haiku-baseline ΞΊ by β‰₯
0.05** (i.e., new jury ΞΊ β‰₯ 0.466, vs Haiku-alone 0.416). Writeup story:
"v1's weights-source bug masked correct aggregation; once both bugs
(asymmetric coverage + missing-weight fallback) are fixed, the jury
improves on baseline. Per-dimension exclusion remains a v1.2 design
pattern but is not needed at v1.1." This is the strong story.
- **Outcome 2 β€” jury ΞΊ within Β±0.05 of Haiku-baseline** (i.e., 0.366 ≀
jury ΞΊ ≀ 0.466). Writeup story: "weights-source fix recovers parity
but the jury isn't doing meaningful work on completeness β€” gpt-4o-
mini's near-zero weight makes it effectively excluded by aggregation.
This is *soft exclusion via weighting*; v1.2 will make exclusion
explicit." Defensible but less clean.
- **Outcome 3 β€” jury ΞΊ falls below Haiku-baseline ΞΊ by >0.05** (i.e.,
jury ΞΊ < 0.366). Writeup story: "weights-source fix is necessary but
not sufficient; even at near-zero weight gpt-4o-mini's verdict tips
disputed (1, 2) ties due to the round-down rule. v1.1 escalates to
per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.
**Why the predefined-criteria framing matters.** "I ran B, looked at the
number, decided it was good enough" is the same data with a weaker frame
than "I predefined the success criteria before running the experiment, B
landed at outcome X, which mapped to predefined response Y". The latter
demonstrates evaluation maturity in the writeup; the former invites
post-hoc reading of the outcome.
**B outcome β€” 2026-05-06.** Plan B re-aggregated the existing 164 sidecar
rows with ΞΊ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on
completeness; clipped at 0 from raw ΞΊ values). Result: **jury ΞΊ on
completeness = 0.416**, exactly matching Haiku-baseline. Ξ” = 0.000;
maps to **Outcome 2 (soft exclusion via weighting)**. Per the
pre-committed response, v1.1 stops here and writes up; per-dimension
member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.
Mechanism, validated empirically β€” a disputed cell (Haiku=2, gpt=1)
with corrected weights aggregates as `(2 Γ— 0.416 + 1 Γ— 0.020) / 0.436 =
1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the
correct verdict. v1's two compounding bugs (asymmetric source returning
weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt-
4o-mini) jointly forced equal weights, and equal-weights with the same
round-up rule produced `(2 Γ— 1 + 1 Γ— 1) / 2 = 1.5`, which has frac
exactly 0.5 (not > 0.5), and floored to 1 β€” gpt's verdict winning every
disputed cell. The bug fixes recover the right verdict purely
mechanically; no judge model behavior changes.
The empirical reading: the weighting is *not doing meaningful work* β€”
gpt-4o-mini's near-zero weight effectively excludes it on completeness,
and the jury's ΞΊ matches Haiku-alone exactly because Haiku's verdict
wins every disputed cell. This is "soft exclusion via weighting"; v1.2's
explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the
exclusion visible in the jury config rather than emergent from ΞΊ-derived
weight collapse.
**v1.1 code changes (this commit):**
- `agent_bench/evaluation/variance/jury.py` β€” silent missing-weight
fallback to 1.0 β†’ hard `ValueError`. Two existing tests that asserted
the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`,
`test_kappa_weighted_logs_warning_on_missing_weight`) updated to
assert the new contract.
- `scripts/run_calibration.py::_load_weights_from_baseline` β†’
`_compute_kappa_weights` β€” replaces the v1 stub with real per-judge
Cohen's ΞΊ on the dimension; hard-errors when any expected member is
missing from the source. Clips ΞΊ < 0 to weight = 0 (soft exclusion).
- `configs/calibration/rows/jury_kappa_weighted.yaml` β€” `weights_source`
re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only,
asymmetric coverage) to
`calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges,
same calibration set with documented circularity).
- `tests/scripts/test_run_calibration_dispatch.py` β€” two new tests cover
`_compute_kappa_weights`: (a) computes real ΞΊ (high-agreement judge β†’
weight=1.0, chance-agreement judge β†’ 0); (b) hard-errors on
asymmetric source coverage.
- `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` β€” new
predictions row produced by re-aggregating the existing sidecar
offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`).
`docs/_generated/kappa_table.md` regenerated with this row alongside
the broken v1 row, giving the writeup a clean before/after diff
(completeness: 0.014 β†’ 0.416, n=26).
- `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` β€” the
three side-by-side reasoning + evidence_quotes excerpts (q006 /
k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split
diagnostic finding.
The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1
(pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.
## Plan 3A β€” recency-positioned paraphrase instruction (pre-committed criteria)
**Date:** 2026-05-06. **Status:** in-flight; this entry pins down the
hypothesis and success criteria before the experiment runs.
**Hypothesis sharpened by the 1A direction-of-bias finding.** GPT-4o-
mini's completeness disagreements are 17/19 gold=2/pred=1 with zero
up-mistakes across 26 items spanning two corpora β€” direction-aware noise,
not balanced random labeling. The model is consistently applying *some*
rule stricter than the rubric requires. The hypothesis under test: that
stricter rule is "literal-string match required, paraphrase doesn't
count," and the bias is fixable by recency-positioning the rubric's
"paraphrase allowed" instruction adjacent to the commit-to-score
decision instead of leaving it 500+ tokens upstream in the rubric body.
**The intervention is positional, not lexical.** The current
`CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`)
sends the rubric body, then the gold reference, then the system answer,
then a one-line "Score this answer..." instruction immediately followed
by the JSON schema clause. The rubric body's "paraphrase allowed" clause
appears in the introductory paragraphs, hundreds of tokens before the
score decision. The intervention adds one sentence between the system
answer and the score instruction:
> *"Note: a paraphrase that captures the same meaning as a gold-answer
> point counts as covered. Score on content equivalence, not surface
> form."*
This is the recency-positioning hypothesis: the model loses the
paraphrase conditioning across the rubric anchors and the reasoning
step. Restating the instruction adjacent to the score decision tests
whether the bias is positionally correctable.
**Selected 5 disputed items** (representative of the gold=2 / Haiku=2 /
gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`,
`k8s_018`. All four are pure paraphrase-coverage cases (the system
answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored
1 with the extraction-vs-reasoning split documented in
`measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`).
**Pre-committed 3A success criteria.**
- **Fixed (β‰₯3/5 shift from 1 β†’ 2):** Recency-positioning is sufficient.
Re-run GPT-4o-mini on the full 26 disputed items with the corrected
prompt, recompute ΞΊ, update the writeup table. Story: "rubric-
engineering matters more than judge model choice for ordinal scales β€”
recency-positioning the paraphrase instruction recovered N% of
disputed items." The completeness story becomes actionable, not
diagnostic-only.
- **Partially fixed (1–2/5 shift):** Inconclusive at N=5 (binomial-
significance line is ~3+). Re-run on the full 26 disputed items
(~$0.20) to get a clean number; write up whatever the full-26 says.
- **Not fixed (0/5 shift):** The instruction is being received and
ignored β€” the model can't act on it under reasoning load. Escalate
to 4A (GPT-4o full on the same 5 items) to verify the small-model-
specific claim. Story: "repositioning the paraphrase instruction
adjacent to the score decision did not shift any of 5 disputed items;
GPT-4o handled the same prompts. The bias is small-model-specific,
not prompt-fixable."
The 3/5 threshold is the binomial-significance line at this N β€” random
shifting under the null produces 0 or 1 changes most of the time. Pre-
committing avoids the "2 shifted, that's kind of a fix" negotiation.
**On the 1A relevance finding β€” confirmed.** Both judges essentially
correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); ΞΊ
degeneracy is structural under 29/30 prevalence at class-2; AC1 +
raw agreement is the right reporting. No further investigation on
relevance. Writeup paragraph is one short sentence: prevalence-induced
degeneracy β†’ AC1 is load-bearing.
## Plan 3A β€” outcome on the 5-item probe + full-26 re-run (v1.1.1)
**Date:** 2026-05-06. **Status:** complete; the v1.1.1 prompt is now
permanent in `agent_bench/evaluation/judges/completeness.py`.
**3A 5-item probe:** 3/5 disputed items shifted 1 β†’ 2 (q006, q011,
k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre-
committed threshold (β‰₯3/5 β†’ "fixed"), so the protocol triggered the
full-26 re-run on gpt-4o-mini only (Haiku held as control to make the
v1.1 β†’ v1.1.1 delta cleanly attributable to the intervention's effect on
the affected judge).
**Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):**
| | n | raw | ΞΊ | AC1 |
|------------------------------|----|--------|--------|--------|
| v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 |
| **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** |
| v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 |
**Per-item delta (v1.1 β†’ v1.1.1):** 7 items shifted up (1 β†’ 2 or 1 β†’ 2),
0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2
items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016,
k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1
in v1.1, now over-credited at pred=2). Net per-item correctness delta:
+5 items.
**Cohen's ΞΊ is misleading on this comparison.** v1.1.1 raw agreement
rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from
0.006 to 0.232 (38Γ— improvement). But Cohen's ΞΊ stayed at ~0 β€” slightly
*lower* than v1.1's 0.020. The mechanism is prevalence-rebalancing in
the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19,
2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer
to gold's `{1:5, 2:23}` over n=28). Cohen's ΞΊ = `(P_o - P_e)/(1 - P_e)`;
when marginals become more diverse, P_e (chance agreement) rises in
lockstep with P_o (observed agreement), and ΞΊ deflates. AC1 uses
prevalence-robust chance correction (`P_e = (1/(q-1)) Ξ£ pi_k(1-pi_k)`)
and reads the actual signal.
This is the same trap that motivated AC1 over ΞΊ on the relevance and
groundedness rows of the original ΞΊ table, surfacing here at a
different distribution boundary. The ΞΊ table footer already explains
why per-dimension metric selection matters; v1.1.1's outcome
demonstrates the trap *induced by the intervention itself*.
**Effect on the jury aggregate.** With ΞΊ-derived weights and gpt-4o-
mini's v1.1.1 ΞΊ at 0 (clipped from +0.000 to weight=0), the jury
verdict on completeness is now mathematically equivalent to Haiku-alone
on every item (gpt's contribution is multiplied by zero). Jury ΞΊ stays
at 0.416, identical to v1.1's corrected aggregate. The intervention's
per-member improvement is *invisible at the jury level* under this
weighting scheme.
**Methodological consequence β€” v1.2 fix-list addition.** The v1.2 fix-
list now expands by one item:
5. **Prevalence-robust weights for prevalence-skewed dimensions.**
v1.1's `_compute_kappa_weights` uses Cohen's ΞΊ for every dimension,
which has a *self-defeating property* on prevalence-skewed gold:
improving a member can lower its weight even as it gets more
accurate.
**Mechanism.** Cohen's ΞΊ = `(P_o - P_e) / (1 - P_e)`, where
`P_e = Ξ£_k P(gold=k) Γ— P(pred=k)` is the chance-agreement term
computed from the marginal distributions. P_e is *not* invariant to
the predictor's marginal distribution β€” when a member's predictions
become more diverse (less concentrated at one class), P_e *rises*
as the marginals approach gold's marginals. Concretely: when an
intervention moves a member's pred distribution from concentrated-
at-one-class toward gold's distribution, P_o and P_e rise together
in lockstep. The numerator `P_o - P_e` stays small, and ΞΊ deflates
even as raw accuracy improves. This is the same prevalence-induced
degeneracy that motivated AC1 over ΞΊ on relevance/groundedness rows
in the ΞΊ table β€” it surfaces in jury weighting at any
distribution-shifting intervention's boundary.
**Empirically observed in v1.1.1.** The recency-positioning
intervention shifted gpt-4o-mini completeness pred dist from
`{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's
`{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% β†’ 42.9%.
AC1 (Gwet 2008) reads the change correctly: 0.006 β†’ 0.232 (38Γ—).
Cohen's ΞΊ stays at ~0 (0.020 β†’ 0.000) because P_e is now β‰ˆ P_o
β‰ˆ 0.43. v1.1's `_compute_kappa_weights` clips the new ΞΊ at zero,
producing weight = 0 β€” and the jury aggregate loses access to a
member that was empirically improved. The intervention's per-
member improvement is invisible at the jury level under ΞΊ-weighting.
**Architectural decomposition for v1.2.** The right separation:
- **Per-dimension metric for ΞΊ table reporting** (already in v1.1
via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`).
- **Per-dimension weight metric for jury aggregation** (new in
v1.2, reuses `_DIM_METRIC`). Use ΞΊ where the gold's prevalence
supports it, AC1 where ΞΊ degenerates. Same lookup, same per-
dimension policy at both reporting and weighting layers.
- **Per-dimension membership as explicit configuration override**
for members that are structurally inappropriate (v1.2 fix #3,
unchanged) β€” distinct from "low score on the chosen metric,"
which is handled by the weight floor.
**Why this is non-obvious.** A reader's first instinct is that
"weight by ΞΊ" is a sensible default β€” ΞΊ is *the* standard inter-
rater statistic. The self-defeating property is invisible until
you observe a real intervention that shifts marginals; in static
conditions (no intervention, fixed prompts), the ΞΊ-weight choice
is benign. The v1.1.1 outcome is the first time the agent-bench
calibration set has produced an intervention-induced marginal
shift on the same gold; the failure mode wouldn't have been
visible in v1.0's static calibration sweep.
**v1.1.1 code changes (this commit):**
- `agent_bench/evaluation/judges/completeness.py` β€” adds
`PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system
answer and the score instruction. Comment cites the 3A probe.
- `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt`
β€” extends to assert the recency clause appears AND is positioned
between the answer and the score instruction (position is load-
bearing, not just lexical inclusion).
- `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
β€” merged sidecar: v1.1 groundedness/relevance rows (unchanged
judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku
completeness rows.
- `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` β€” the
5-item probe artifact with reasoning + evidence_quotes for each.
- `scripts/_dev/probe_3a_paraphrase_recency.py`,
`scripts/_dev/rerun_completeness_v1_1_1.py` β€” reproducers; not
part of the production calibration runner.
**No changes to the ΞΊ table.** The jury aggregate ΞΊ on completeness is
unchanged (0.416 β†’ 0.416) because of the ΞΊ-as-weight degeneracy
described above; adding a `jury_kappa_weighted_v1_1_1` row with
identical numbers would be visual noise. The v1.1.1 finding lives in
the writeup body, not the table β€” the per-member AC1 improvement
(0.006 β†’ 0.232) is the headline number, surfaced as a separate
paragraph next to the ΞΊ table rather than inside it.
**Total spend through Plan 3A:** $0.0013 (3A probe) + $0.0075 (full-26
re-run) = $0.0088.
## Plan 4A β€” GPT-4o (full) on the v1.1.1 residual
**Date:** 2026-05-06. **Status:** complete. Run after the writeup-
framing review surfaced that v1.1.1's "fixed" verdict was overclaim-
prone β€” 5/19 items were recovered, 14 remained unchanged and
uncharacterized. 4A was originally scoped as conditional on 3A *not*
being fixed (per the predefined sequencing rule), but became valuable
as a *post-3A* diagnostic to characterize the residual: is it small-
model-specific or rubric-under-specified?
**Scope.** GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged
items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1
production prompt (paraphrase recency clause active). The first two
(k8s_006, k8s_018) are the items that didn't shift in the original 3A
5-item probe β€” we have gpt-4o-mini's reasoning on those items *with*
the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt
varying only the model. q011, q012, k8s_001 cover the broader
fastapi/k8s residual surface (k8s_001 also a Haiku miscall β€” 4A
checks whether GPT-4o agrees with gold or with Haiku).
**Result: 5/5 correct.** All 5 items scored 2 by GPT-4o, matching gold
exactly. Cost: $0.0011 reported (caveat: pricing config falls back to
gpt-4o-mini rates for unlisted models, so actual cost is closer to
$0.005–0.01 β€” the reported number under-reports by ~5–10Γ—).
**Sharpened mechanism β€” criteria-invention, not just literal-match.**
The original 3-example artifact (q006, k8s_002, k8s_018) was framed
as gpt-4o-mini "applying a literal-string-match standard" while
correctly extracting paraphrased coverage into evidence_quotes. 4A's
side-by-side reasoning on `k8s_018` shows a distinct second mechanism:
- **gpt-4o-mini (v1.1.1, score 1):** "It mentions some key points
from the reference... but does not explicitly state that the new
fields in `autoscaling/v2` are preserved as annotations when using
`autoscaling/v1`, nor does it mention the need to use
`autoscaling/v2` directly for memory or custom metric scaling for
a Deployment or StatefulSet."
- **gpt-4o (4A, score 2):** "The answer covers all the key points
from the reference. It mentions that the current stable version is
autoscaling/v2, which supports scaling on memory and custom
metrics, similar to the reference. It also notes that
autoscaling/v1 only supports CPU-based scaling, aligning with the
reference's points."
The reference for k8s_018 specifies three points: (1) autoscaling/v2
is the current stable API, (2) it adds memory metrics support beyond
v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's
reasoning step *invents additional criteria* the reference does not
require ("preserved as annotations when using autoscaling/v1," "use
autoscaling/v2 directly for ... a Deployment or StatefulSet") and then
deducts against them, scoring 1. GPT-4o reads the reference's three
points and scores against exactly those, scoring 2.
This is a *capacity* finding distinct from the paraphrase-recency
finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing
it toward paraphrase semantics, manufactures additional gold criteria
during scoring that aren't in the reference. Recency-positioning the
"paraphrase allowed" clause doesn't address this β€” the bias isn't
"missed paraphrase," it's "invented extra requirements." Two failure
modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.
**Implication for v1.2.** With 5/5 confirmed, v1.2 fix #3 (per-
dimension membership) gets clean empirical support: gpt-4o-mini is
the wrong tool for 3-point completeness with paraphrase semantics, and
no amount of prompt engineering on this rubric is going to bridge the
capacity gap. The right v1.2 path is one of:
- **Exclude gpt-4o-mini from completeness scoring** (per-dim
membership; jury reduces to single-judge Haiku on completeness;
explicit and visible in config).
- **Replace gpt-4o-mini with GPT-4o on completeness** (per-dim
judge selection; jury keeps two members but the second is a
frontier-class model on the dimension that needs it).
Both are defensible v1.2 designs. The choice depends on cost
budget β€” gpt-4o is ~10Γ— the per-call cost of gpt-4o-mini. For
agent-bench's calibration set scale (~30 items Γ— per-row), even gpt-
4o is trivially cheap; for production deployment evaluating thousands
of agent outputs, the cost trade-off matters more.
**4A artifact:** `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl`
(per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs
with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for
the side-by-side analysis above).
**Updated honest framing for the writeup.** "v1.1.1 addressed one
identified failure mode (paraphrase-instruction-loss across reasoning,
recovered 5/19 disputed items via positional change). 4A confirmed the
residual 14 are a distinct failure mode (capacity-limited criteria
invention during the reasoning step) β€” GPT-4o handles all 5 sampled
residuals at the same v1.1.1 prompt, so the failure is small-model-
specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge
membership / model selection) is the right escalation; the rubric
itself doesn't need changes."
**Total session spend:** $0.0099 reported (~$0.013–0.018 actual after
gpt-4o pricing correction).