Spaces:
Running
Design Decisions
Why build from primitives, not LangChain?
I wanted to demonstrate I understand tool dispatch, memory management, and retrieval orchestration at the implementation level. My provider abstraction is ~150 lines. When reranking or a second provider is needed, I know exactly where it plugs in — because I built every layer.
Why one provider in V1?
The interface supports multiple providers. V1 shipped OpenAI + Mock to prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming that switching providers is a one-line config change. The orchestrator and tools are completely unchanged between providers.
Why one domain (technical docs)?
Clean text produces clean evaluation. Research papers introduce PDF parsing noise (tables, figures, formulas) that degrades eval quality without adding signal. The framework handles any text corpus — the domain is a config swap.
Why Reciprocal Rank Fusion, not score normalization?
BM25 scores and cosine similarities live on different scales.
Normalizing across scales is brittle: min-max normalization is
sensitive to outliers, z-score requires distribution assumptions.
RRF fuses by rank position 1/(k + rank), which is robust,
parameter-light (only k=60), and well-studied. Trade-off: loses
magnitude information, but for top-5 retrieval this rarely matters.
Why ~16 curated docs, not a large corpus?
Small corpus produces predictable retrieval, reproducible benchmarks, and easy debugging. Golden dataset questions map to specific source files. A reviewer can verify any result by reading the source. The framework scales to larger corpora — the choice is about evaluation quality, not capability.
Why no reranker in V1?
Feature-flagged in config (rag.reranker.enabled: false). V1
benchmarks without reranking establish an honest baseline. V2 adds
cross-encoder reranking and shows the delta.
Why no delete in the vector store?
FAISS flat index doesn't support efficient deletion. For a small corpus that changes rarely, rebuild-on-ingest is simpler and eliminates consistency bugs.
Why async internals, sync user behavior?
FastAPI and the OpenAI SDK are async-native. Using async for I/O
avoids blocking the event loop. V2 added SSE streaming (/ask/stream)
for the final synthesis step — tool calls remain non-streamed since
they complete in ~100ms.
Why SQLite-backed conversation sessions
V1 was stateless by design — no conversation_id, no cross-request
memory. V2 adds optional SQLite-backed sessions: pass session_id
on /ask to persist and load conversation history. When omitted,
behavior is identical to V1 (stateless). See the dedicated
DECISIONS.md entry under "Why SQLite for conversation persistence"
for the full rationale.
Why negative evaluation cases?
A system that always answers sounds confident but may hallucinate. 5 out-of-scope questions test whether the system refuses gracefully when the corpus doesn't contain the answer. Grounded refusal requires both refusal language AND zero cited sources — an answer that says "not found" but still cites docs is not a valid refusal.
Why deterministic eval + optional LLM judge?
CI needs free, deterministic tests. Retrieval P@5, keyword hit rate, citation accuracy, and grounded refusal rate run without API keys. LLM-judged faithfulness and correctness are manual enrichment steps, run locally, with results committed to the benchmark report.
Why structlog, not LangSmith/Langfuse?
Third-party observability contradicts the "built from primitives"
narrative. structlog provides JSON-structured logs, /metrics
exposes latency + cost. If a team uses LangSmith, adding it is
a one-day task.
Why explicit citation format [source: filename.md]?
The system prompt mandates inline citations so the evaluation
harness can parse them with \[source:\s*(.+?)\] and check
against the structured sources list. This catches hallucinated
citations and measures citation accuracy as a metric.
Why per-request retrieval settings via kwargs, not singleton mutation?
The orchestrator is a singleton shared across concurrent requests.
Storing top_k / strategy on self causes cross-request state
bleed. Instead, these are passed as local variables through the
tool execution kwargs — no shared state is mutated.
Why a relevance threshold for grounded refusal
V1 never refuses — it always retrieves tangentially related content and synthesizes an answer. This is a trust failure: users cannot distinguish "the system found relevant information" from "the system fabricated from vaguely related chunks." Grounded refusal rate was 0/5.
We add a refusal gate in SearchTool.execute() based on the maximum RRF
score across retrieved chunks. If no chunk scores above the threshold, the
tool returns "No relevant documents found" — the LLM then refuses via the
system prompt rather than fabricating from irrelevant content.
Gate location: The gate fires in SearchTool.execute(), not the
orchestrator. SearchTool is where retrieval scores are still available —
they are dropped before results reach the orchestrator. This also keeps
the orchestrator unchanged.
Threshold value: rag.refusal_threshold: 0.02 is a provisional default
pending an empirical sweep across the evaluation set. The sweep will test
values 0.01–0.03 and select the value that maximizes refusal on out-of-scope
queries without degrading in-scope P@5 and R@5. The actual RRF score
distribution will be documented here after tuning.
Interaction with reranking: The refusal gate fires on RRF scores BEFORE reranking. It is a go/no-go decision, not a per-chunk filter. If the gate passes, the full candidate set proceeds to the reranker. This keeps the threshold calibration independent of whether reranking is enabled.
Default disabled: refusal_threshold: 0.0 preserves V1 behavior exactly.
The feature is opt-in until the threshold is tuned.
Alternative considered: LLM-based relevance judgment ("is this content relevant to the query?"). Rejected because it adds latency, cost, and a second point of failure. The score-based approach is deterministic, fast, and debuggable.
Why cross-encoder reranking improves precision
BM25 retrieves lexically similar but semantically irrelevant chunks. RRF fusion mitigates this partially, but noisy BM25 results still dilute the top-5 set. P@5 was 0.70 in V1.
A cross-encoder (ms-marco-MiniLM-L-6-v2, ~80MB) scores each
(query, chunk) pair jointly, capturing semantic relevance that
bi-encoder similarity misses. The tradeoff is ~100–200ms extra latency
per query — acceptable given our 4.7s baseline is dominated by LLM
generation, not retrieval.
The reranker is enabled by default. Setting rag.reranker.enabled: false
restores V1 behavior exactly. reranker.top_k is independent of
retrieval.top_k, so the reranker's output count can be tuned without
affecting the RRF candidate pool.
The retriever passes all RRF-fused candidates to the reranker rather
than a computed subset. The reranker's top_k handles truncation.
This is simpler and more robust than computing an input size from
per-system candidate counts.
Why provider retry with exponential backoff
OpenAI returns 429 (rate limit) errors under load. Without retry logic,
a single 429 causes a user-visible failure. We add exponential backoff:
attempt after 1s, 2s, 4s. After 3 retries, raise ProviderRateLimitError
so the middleware returns a clear 503.
The retry wraps the raw openai.RateLimitError — it must fire BEFORE
the error gets translated to ProviderRateLimitError, otherwise retry
logic is dead code. Other errors (400, 401, timeout) fail immediately.
Why in-memory API rate limiting
A public-facing API needs abuse protection. We use a simple in-memory sliding window limiter: 10 requests/minute per IP. Sufficient for a demo deployment; a production system would use Redis.
Known limitation: the per-IP dict grows without bound across distinct IPs. Acceptable for Fly.io with auto-stop (memory resets). If running continuously under bot traffic, add a periodic sweep or switch to a TTL-based structure.
Design choices:
/healthand/metricsexempt: monitoring should never be rate-limited.Retry-Afterheader: follows HTTP 429 spec, lets clients back off.
Why SQLite for conversation persistence
Three options considered:
- In-memory dict: Lost on restart.
- SQLite: Zero-dependency, file-based, survives restarts.
- Redis/PostgreSQL: Adds infrastructure complexity.
SQLite is right for this scale. session_id is optional — when omitted,
the system behaves identically to V1 (stateless). This preserves backward
compatibility and keeps benchmark evaluation deterministic.
The route handler manages session state (load history, store Q+A), not
the orchestrator. The orchestrator accepts an optional history parameter
but has no knowledge of persistence. This keeps the agent loop testable
without a database.
Note: On HF Spaces, SQLite is ephemeral (no persistent storage on free tier). For the demo this is acceptable — sessions last until the container sleeps. Production would use a volume or managed database.
Why a second provider (Anthropic)
The provider abstraction existed since V1 but only had OpenAI + Mock. Adding Anthropic proves the abstraction works across fundamentally different APIs:
- System message:
system=parameter, not in the messages list - Tool definitions:
input_schemainstead ofparameters - Tool results:
tool_resultcontent blocks in user messages - Tool calls:
tool_usecontent blocks, not a separate field - Stop reason:
tool_usevsstop
The implementation is a config swap — provider.default: anthropic in
YAML switches the entire system to Claude. The orchestrator, tools,
evaluation harness, and serving layer are completely unchanged.
Same retry/timeout handling as OpenAI. Both providers are tested with mocked HTTP responses — no API keys needed in CI.
Why ranked_sources separate from deduplicated sources?
The deduplicated sources list in AgentResponse is for the API
response. The ranked_sources list preserves rank order with
duplicates for evaluation metrics. P@5 and R@5 need the raw
retrieval ranking, not the post-processed answer metadata.
Why vLLM over TGI / llama.cpp
vLLM has the widest model support, best throughput via PagedAttention, and a native
OpenAI-compatible server (/v1/chat/completions). TGI is a valid alternative; llama.cpp
targets different use cases (edge/CPU inference). This is a deliberate choice, not
ignorance of alternatives.
Why Modal for GPU inference
Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs
~$0.50 per full 27-question benchmark run. The Docker Compose path (docker-compose.vllm.yml)
is retained for users who have local GPUs or prefer persistent serving.
Why split topology (K8s API + Modal GPU)
The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from serverless elasticity — Modal scales to zero when idle, scales up on demand with no node provisioning. Co-locating both in K8s would require GPU node pools with idle cost, node autoscaler latency, and NVIDIA device plugin management. This mirrors a common production pattern.
Why Helm only, not Kustomize + Helm
Showing two K8s deployment methods for the same app adds complexity without demonstrating
distinct skills. Helm with values-dev.yaml / values-prod.yaml covers
environment-specific configuration cleanly.
Why CPU-based HPA, not custom metrics
CPU utilization works without a Prometheus adapter or custom metrics server. A production
improvement would use the Prometheus adapter to scale on p95 latency from the /metrics
endpoint — this requires bridging the JSON metrics to Prometheus exposition format.
Documented as a follow-up.
Why env var fallback in SelfHostedProvider
Follows the same pattern as OpenAIProvider reading OPENAI_API_KEY. The YAML config
provides defaults; env vars override at runtime. No config loader changes needed.
Why lazy tool-call detection, not metadata check
Checking /v1/models metadata for tool-calling support is unreliable — model metadata
doesn't consistently report this capability. Instead, the provider sends one tool-calling
request on first complete() call with tools and checks if the response contains
tool_calls. The result is cached as self._supports_tool_calling. Transient failures
(timeout, 5xx) return None and retry on the next call rather than permanently
downgrading to prompt-based fallback.
Why two-tier injection detection, not three
The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity — "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth.
Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only — documented, not hidden.
Why regex + optional spaCy for PII, not a cloud API
Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses.
spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (try: import spacy) degrades gracefully with a logged warning — no crash if someone sets use_ner: true without installing spaCy.
Why append-only JSONL for audit, not SQLite
One codepath, one format, no config branching. JSONL is append-only by nature — no schema migrations, no transactions, no connection pooling. Log rotation handles size. jq provides immediate queryability without building a custom API.
The original design included an optional SQLite backend and a query endpoint (GET /admin/audit). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication — an inconsistency when /ask itself has no auth.
JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned.
Why HMAC-SHA256 IP hashing in audit logs
HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, AUDIT_HMAC_KEY env var, or (with a logged warning) a random per-process fallback.
Why three output validators, not four
The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks — PII leakage, URL hallucination, blocklist — are deterministic with clear pass/fail semantics.
Why buffer-then-validate for streaming output
The /ask/stream endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative — streaming chunks immediately and appending a safety marker — leaks unsafe content to any client that stops reading after the done event.
Why no authentication on API endpoints
The HF Spaces demo is public by design — the curl examples in the README work without credentials, which is the point. Adding API key authentication would gate access but break the zero-friction demo experience that makes the project evaluable.
The security pipeline protects content (injection detection, PII redaction, output validation), not access. This is a deliberate scope boundary: application-layer guardrails ensure the system behaves safely regardless of who calls it, rather than assuming trusted callers. Rate limiting (10 RPM per IP) provides basic abuse protection.
A production deployment would add authentication (API keys or OAuth) at the infrastructure layer — reverse proxy, API gateway, or middleware. The security pipeline's getattr(..., None) pattern means auth can be layered on without modifying the existing security components.
Why monitor mode for output validation, not gating?
Output validation runs post-stream as a monitoring layer. The answer streams to the client, then validation runs and emits its verdict. Gating (buffer-then-validate) would add 4-5 seconds of dead air while the full answer generates — unacceptable streaming UX for a documentation Q&A bot. Trade-off: a hallucinated URL or PII fragment could reach the client before validation catches it. For this use case (FastAPI docs, no real PII in corpus), the risk is near-zero. The dashboard labels this "monitored" (not "gated") to be explicit about the posture.
Why named residual risks and scope limits, not "fully mitigated" verdicts?
The OWASP LLM Top 10 (2025) mapping could have been written as a 10-row table where LLM01 and LLM02 read as "addressed" without qualifiers — shorter and cleaner-looking. Rejected because OWASP's own 2025 text is explicit about what an input guardrail can and cannot do, and writing a verdict that contradicts the source the mapping cites would be compliance theater.
LLM01 Prompt Injection — OWASP 2025 states that RAG and fine-tuning do not fully mitigate prompt injection, and that indirect injection through retrieved content remains a core risk class. "Fully mitigated" is unsupportable for any system retrieving untrusted content into an LLM context window, which is every RAG system including this one. The LLM01 verdict reads "addressed directly with named residual risk"; the residual-risk cell cites OWASP's own "do not fully mitigate" language verbatim.
LLM02 Sensitive Information Disclosure — OWASP 2025's LLM02 mitigations span four concern classes: access controls, training-data handling, user-consent transparency, and proprietary-information governance. This implementation addresses a narrower output-side subset (output validation for PII leakage, secret formats, and URL hallucination) — not cleanly one of the four concern classes, but a narrower scope than any of them. The verdict reads "addressed directly for the applicable scope"; the scope-limit cell enumerates the four concern classes verbatim and names what addressing the broader concerns would require (multi-tenant or authenticated architecture).
The tension the entry resolves is honesty-vs-scannability: a mapping that surfaces named residual risks and scope limits is longer and harder to skim than one with uniform "addressed" verdicts, but the scannable version over-claims relative to the cited source. Honest evaluation is the brand. Every verdict cell in SECURITY.md must survive a reviewer reading OWASP 2025 in a second tab.
See SECURITY.md § LLM01 Prompt Injection and § LLM02 Sensitive Information Disclosure for the verdict cells; this entry covers why the verdict discipline takes the form it does. The LLM01 "do not fully mitigate" phrasing and the LLM02 four-concern-class enumeration are canonical in SECURITY.md; the README tail and landing-page subtitle paraphrase but must preserve the named-residual-risk and scope-limit structure.
Why additive SSE stage events?
The enhanced /ask/stream adds meta and stage event types alongside
the existing sources, chunk, and done events. Existing consumers
that only handle the three legacy types are unaffected — they simply
ignore events with unknown types. This avoids versioning the endpoint
or breaking the non-streaming /ask contract. The meta event fires
first (before any stages) so the frontend can display provider/model
info immediately.
Why vanilla JS for the frontend, not Alpine or React?
The showcase dashboard has ~5 pieces of reactive state (pipeline stages,
retrieval results, security badges, stats, chat messages). The SSE
handler is inherently imperative: receive event, querySelector the
target node, update classList and textContent. Wrapping this in a
reactive framework adds a dependency, interview questions about
"why is there a framework for 5 state variables", and indirection
that fights the imperative SSE pattern. One state object + a few
render() functions handles it in ~150 lines.
Phase 1 SSE gate closure — two baselines on record, not one
The Phase 1 acceptance gate for the SSE backend work (meta event,
stage events, iteration-aware metadata threading, route-level
injection/output-validation events) requires re-running
make evaluate-fast and confirming numbers match pre-change state
on the pinned gpt-4o-mini-2024-07-18 snapshot. The re-run was
honored literally rather than substituted with a git-diff
argument, even though the SSE commits did not touch
scripts/evaluate.py's legacy code path. Two reasons: the
re-commitment discipline that kept Fix 1 and Fix 2 honest applies
equally here, and the legacy path and the --corpus fastapi path
produce materially different baselines that cannot substitute for
each other.
Two distinct baselines now exist at the pinned snapshot, and both are on record — one per prompt path:
| Baseline file | Invocation | Prompt source | In-scope P@5 | In-scope R@5 | Citation | Mean calls |
|---|---|---|---|---|---|---|
results/fastapi_preedit.json @ 213da36 |
--corpus fastapi |
format_system_prompt("FastAPI") |
0.718 | 0.833 | 1.000 | 1.14 |
results/fastapi_legacy_baseline_pinned.json @ this commit |
make evaluate-fast (no --corpus) |
tech_docs.yaml task.system_prompt |
0.655 | 0.849 | 1.000 | 1.45 |
Citation accuracy holds at 1.000 on both paths, both in-scope and
out-of-scope. The retrieval metric deltas (P@5 −0.063, R@5 +0.016,
KHR +0.045) and behavioral delta (mean tool calls +0.318 in-scope,
+1.00 out-of-scope) trace to the prompt-path divergence
(scripts/evaluate.py:67 reads task.system_prompt in the legacy
branch vs. format_system_prompt(label) in the --corpus branch),
not to any change in retrieval, reranking, or refusal-gate code.
This divergence is the same one the "evaluation-layer multi-corpus
support lagged the serving-layer refactor" entry documents; the
narrowed serving-migration deferral tracks its eventual migration.
Why both baselines are retained. When the serving-migration
deferral lands and scripts/evaluate.py's legacy branch is removed
(everything routes through --corpus fastapi), the regression gate
is "post-migration make evaluate-fast output matches pre-migration
--corpus fastapi output within pre-committed tolerances." That
gate requires the --corpus fastapi baseline as the comparison
reference AND the legacy baseline as evidence of the pre-migration
state that is being retired. Retaining both makes the migration
auditable and bounds its regression budget; retaining only one
would force the post-migration run to compare against a baseline
from a different prompt path, guaranteeing the gate fires on
prompt divergence rather than on any actual regression.
Gate verdict: passed. No regression vs pre-SSE legacy path expectations (citation 1.000 holds, refusal gate fires on the same 5 out-of-scope questions, retrieval numbers in sane in-scope ranges). Phase 1 SSE backend work is closed from the backend side; the frontend's consumption of iteration-aware stage events is orthogonal and owned by Week 1 step 7 (showcase UI).
Why per-corpus refusal thresholds?
FastAPI and Kubernetes have different corpus characteristics. FastAPI has 16 short, well-structured docs with sparse cross-references — relevance tends to concentrate in 1-2 chunks per query. Kubernetes has 30-40 docs with heavy cross-referencing between concepts (Pod → Deployment → Service → Ingress), which spreads relevance across more chunks. A single global refusal threshold would either refuse too aggressively on K8s (no single chunk dominates, so the top score looks "low") or not aggressively enough on FastAPI (where a moderate-scoring chunk might be the only hit and should still refuse).
CorpusConfig carries refusal_threshold as a per-corpus field.
Each threshold gets tuned against its own golden dataset — there
is no "fair" shared threshold because BEIR showed these are not
comparable across corpora. Placeholder values ship in default.yaml
and are replaced by tuned values during the per-corpus evaluation
sweep.
Why corpus and provider toggles compose — corpus_map[corpus][provider]
The simpler design would have been corpus_map[corpus] returning a
single orchestrator. It ships in 10 fewer lines. It also silently
breaks the provider toggle in multi-corpus mode: the orchestrator
inside each corpus cell holds one fixed provider, and clicking
"Anthropic" in the dashboard keeps running on OpenAI.
This project's hero-tile metric is the provider comparison (1.00 API / 0.14 7B self-hosted). Breaking the mechanism that demonstrates that
metric — on a portfolio demo where a reviewer will open DevTools and
notice — would erode the honest-evaluation brand the whole repo is
built around. The nested corpus_map[corpus][provider] structure
keeps both toggles functional. Store, retriever, and search tool are
shared across providers within a corpus (the expensive objects are
held once per corpus); only the orchestrator varies per provider
since it holds the LLM client. Per-corpus × per-provider memory
overhead is an orchestrator struct, not a FAISS index.
RSS is logged per corpus, not per corpus × provider, because the store is what drives memory. The provider multiplier is negligible compared to a hybrid index + embedder.
Why one parameterized system prompt, not per-corpus templates
The template is "You are a technical documentation assistant for {corpus_label}...". The only corpus-specific element is the label;
prompt content is identical across corpora: same citation format,
same refusal language, same grounding instructions. Having two
separate prompt files would invite drift — someone tweaks the FastAPI
prompt for a specific failure mode and forgets to update the K8s
version, and the demo silently answers differently on the two toggles.
The parameterization is enforced by two tests: (a)
format_system_prompt("") raises ValueError so an unresolved
{corpus_label} can never reach the LLM, and (b) a spy on
orchestrator.run_stream asserts FastAPI and K8s requests receive
different prompts with the correct label substituted.
The wording deliberately differs from the typical "don't hallucinate" RAG template:
- "refuse the question explicitly" matches our refusal-gate mechanism. "Say so politely" is soft language that models interpret as "hedge and answer anyway".
- "do not infer, do not extrapolate, do not draw on general knowledge" is the three-verb prohibition. "Do not fabricate" is empirically easier to slip past because models distinguish fabrication (making things up) from extrapolation (drawing conclusions from adjacent but non-authoritative context).
Why Kubernetes curation targets recruiter-likely questions, not coverage
The K8s corpus targets ~30-40 pages curated around concepts a technical reviewer would naturally type (Pod, Deployment, Service, Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that stress the reranker. Cluster administration deep-dives, tutorials, and kubectl reference are explicitly excluded — they add noise without adding reviewer value and hurt retrieval precision when adjacent content is thin on concept definitions.
data/k8s_docs/SOURCES.md is a version-controlled curation artifact.
Each ingested URL has a one-line rationale, a date pulled, and a
license note. This makes the corpus reproducible and documents the
curation reasoning for any reviewer who looks closely.
Trade-off: the corpus is not comprehensive K8s knowledge. A question about etcd raft internals will be correctly refused. This is not a bug — the refusal is part of the demo story, and "the system knows what it doesn't know" is a feature of the grounded-refusal mechanism.
Why no cross-corpus score comparison (inspired by BEIR)
Inspired by BEIR's heterogeneous-benchmark framing (Thakur et al., NeurIPS 2021), which spans 18 datasets across 9 task types, absolute retrieval scores are not treated as comparable across FastAPI and K8s corpora — score distributions depend on chunk length, vocabulary overlap, and corpus density, none of which are held constant across domains. Only rank-ordering of system configurations within a single corpus is meaningful. Concrete consequences for this repo:
- Per-corpus evaluation results are reported separately, never aggregated into a single "combined" number.
- The hero-tile citation accuracy (
1.00 API / 0.14 7B self-hosted) stays FastAPI-specific. It is not restated as a cross-corpus average. make evaluate-fastaccepts a--corpusflag but has no "combined" mode. Anyone who wants a cross-corpus number has to run twice and acknowledge the incomparability in prose.- The landing page "Key Findings" cards avoid sentences that compare FastAPI and K8s numbers directly.
The multi-corpus demo is a surface feature for interactive exploration, not a rebenchmark. The benchmark section of the README remains FastAPI-only and cites 27 questions on 16 docs with specific chunker settings.
K8s golden dataset uses CRAG's 8-type taxonomy as the schema
The K8s golden dataset uses CRAG's 8-type taxonomy (Yang et al.,
NeurIPS 2024) as the schema for question_type, not as a
requirement to cover all 8 types. CRAG's taxonomy: simple,
simple_w_condition, set, comparison, aggregation,
multi_hop, post_processing_heavy, false_premise. Temporal
dynamism is a separate orthogonal property captured as
time_sensitive: bool on the question schema — it is not a CRAG
category.
Target distribution across the 25-question K8s golden set:
simple(5–6): baseline retrievalsimple_w_condition(3–4): nuanced understanding under conditionscomparison(3–4): retrieval across concept pages, reranker stressmulti_hop(5–6): synthesis across 2–4 docs, reranker stressfalse_premise(3–4): grounded refusal mechanismset/aggregation/post_processing_heavy(0–3): included only where corpus content naturally supports
time_sensitive: bool flags 2–3 questions targeting version-bounded
content (feature state, deprecations, API version migration).
false_premise questions come in two flavors (see separate
"False-premise questions come in two flavors" entry): pure refusal
(flavor A) and documented negative (flavor B). The K8s set includes
at least one of each. Flavor A tests the path where retrieval
correctly returns nothing useful; flavor B tests the path where the
corpus contains an explicit negative answer and the agent must
surface it with citation rather than confabulating a positive.
Rationale for using CRAG as schema (not coverage requirement):
false_premise and time_sensitive stress grounded refusal and
reduce test-set contamination risk; multi_hop and comparison
stress the reranker because relevance spreads across multiple
chunks. The distribution was chosen to exercise the parts of the
pipeline the benchmark story claims — not to mimic a general-purpose
QA benchmark.
The golden dataset JSON schema (v2, backward-compatible with the FastAPI flat list) includes:
source_chunk_ids: list[str]for multi-hop partial credit (answer must cite at least one of the expected chunks)source_snippets: list[str]for human-readable context during reviewquestion_type: str(CRAG taxonomy value)is_multi_hop: boolfor filtered reporting- Dataset-level header with
corpus,version,snapshot_date, and pinnedchunkerparameters so the dataset is reproducible against a specific K8s docs snapshot
See docs/plans/2026-04-12-multi-corpus-refactor-design.md for the
full schema and rationale.
EU AI Act corpus deferred to v1.2
EU AI Act compliance mapping is deferred to v1.2. Rationale: v1 ships two corpora (FastAPI, K8s) to demonstrate the multi-corpus architecture; EU AI Act as a third corpus would add ingestion and golden-set work without exercising architecturally new surface. Scoped as the first v1.2 addition after v1 launch.
Cold-start contingency: measure first, lazy-load if needed
Loading two corpora at startup costs memory and cold-start time. On HF Spaces (target deployment), the realistic ceiling is 8-10 GB resident RAM and ~60 seconds cold-start before the demo feels broken.
Policy:
- Measure HF Spaces cold-start on Day 1 of deployment.
- If cold-start < 60 s: plan validated, no changes.
- If cold-start > 60 s: implement a lazy-load path (FastAPI eager, K8s lazy on first K8s request). Scoped ~2 hours implementation.
This contingency is not pre-built. Pre-building a lazy-load path
that may never ship creates dead code that rots, and the test surface
for "lazy loading plus corpus routing plus provider switching" is
non-trivial. The RSS logging in app.py (Task 2) emits the exact
numbers needed to make the decision; the decision is documented here
so future-me remembers the threshold and doesn't optimize prematurely
on a hunch.
Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause
The preceding "Cold-start contingency" entry pre-committed a lazy-load fix (FastAPI eager, K8s lazy on first request) if the measured cold start exceeded 60 seconds. Measurement falsified the entry's core assumption: corpus loading is not the dominant cold-start cost. The committed fix addresses ~1 % of the observed overshoot. Executing it verbatim would honor the gate's letter but not its intent — theater dressed as discipline. This entry documents the measurement, the falsified assumption, and the new contingency pre-committed at the actual cause.
Measurement (N=3, 2026-04-15, HF Spaces target deployment):
| Sample | Cold start | Silent Python init | Visible phase |
|---|---|---|---|
| N=1 | 113 s | ~101 s | ~12 s |
| N=2 | 89 s | ~70 s | ~19 s |
| N=3 | 129 s | ~115 s | ~14 s |
- Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
- Gate fire is unambiguous at both tails. Even the fastest sample (89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is ~115 % over. No boundary ambiguity.
- Sample-size justification. N=3 is acknowledged as a small sample. It is adequate here because (a) the gate-fire conclusion is stable across both tails, (b) the "silent Python init dominates variance" finding is stable across all three samples (silent phase varies 70 → 115 s across runs; visible phase varies only 12 → 19 s), and (c) the cost of additional samples (manual HF Space restart + ~2 min wait + log extraction per sample) exceeds the marginal information gain once both tails fire the gate and the variance pattern is stable. N=4 would tighten the confidence interval on the median but does not change either the gate-fire conclusion or the falsified-assumption finding.
- Variance source named. HF Spaces shared-infrastructure CPU / IO contention during Python module imports. The silent-init phase varies 45 s across samples (70 → 115 s); the visible phase is stable (12–19 s). That is the signature of host-level contention on a shared physical node, not code-level variability. An exclusively-owned container would plausibly show a tighter bound.
- Raw log captures (preserved so this entry can be cross-checked
against the underlying evidence without re-running the measurement):
measurements/2026-04-15-coldstart-n1.log,-n2.log,-n3.log.
Where the cost lives. At the median (113 s):
- Silent Python init phase — ~90 s (≈ 80 % of total): interpreter
start, module imports (
torch,transformers,langchain,faiss,fastapi,httpx, the full dependency closure), and initial model weight loading (all-MiniLM-L6-v2embedder, cross-encoder reranker). Not logged — no observability inside the import chain. - Visible startup phase — ~15 s (≈ 15 % of total): injection
classifier init (
10 s, includes the "classifier skipped" warning), FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s, +25.8 MB RSS), reranker warmup (2 s).
The K8s corpus load — which the pre-committed fix was designed to defer — contributes under 1 second of the 113-second median. Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load is the same order of magnitude. Corpus loading is simply not where the cost lives on this deployment.
Why we are not executing the pre-committed fix. The preceding contingency was written under an empirical assumption about cost attribution (corpus loading is the dominant cost). Measurement falsified the assumption. Implementing the fix anyway would be a mechanical execution of a recipe whose premise has been disproven — it checks the gate-honoring box while failing to address the cause. That is structurally identical to relaxing-by-redefinition ("60 s was too tight"), just in the opposite direction: relaxing by execution. The pre-commitment rule's purpose is to prevent motivated reasoning about the gate, not to mandate mechanical compliance with a recipe whose empirical foundation has collapsed.
The honest action is (1) accept the measurement as the v1 baseline, (2) document the falsified assumption explicitly (this entry), (3) re-pre-commit a new contingency at the actual dominant cost with an explicit trigger condition so the decision is not relitigated at review time, and (4) update the user-facing README surface to reflect the measured cold-wake number rather than the optimistic pre-deploy estimate.
v1.1 contingency — pre-committed:
If HF Spaces traffic produces more than N cold wakes per day (N to be determined from observed usage patterns after launch, not estimated in advance), defer eager loading of (a) the cross-encoder reranker, (b) the sentence-transformers embedder, and (c) the injection classifier tier to first-relevant-request.
Estimated work: 4–6 hours (lazy-init wrappers + first-request caching + integration tests for the warm/cold transition).
Expected tradeoff: cold wake ~113 s → ~50–60 s (approaches the original 60 s target); first request after any cold wake incurs +8–15 s additional latency (model weights load synchronously in the request path), after which subsequent warm requests return to normal ~5 s latency.
Trigger is usage-justified, not estimate-justified. Until real traffic data justifies the work, there is nothing to optimize — a recruiter demo that gets one cold wake per day does not pay for 4–6 hours of engineering plus the new first-request-latency failure mode. The trigger threshold N is left unnamed deliberately: naming a number in advance would invite the same falsification pattern this entry is documenting.
Methodology lesson. When a pre-committed contingency is written under an empirical assumption, the contingency only holds if the assumption survives measurement. If measurement falsifies the assumption, the correct action is to document the falsification, accept the observed baseline, and re-pre-commit at the actual cause. The wrong action is to execute the original recipe anyway, which trades one form of motivated reasoning (threshold relaxation) for another (recipe compliance). The underlying discipline — "pre-commit your gates and honor them" — does not mean "mechanically run the pre-committed fix regardless of what it addresses." It means "honor the gate's intent, which is to prevent motivated reasoning about pass/fail."
Post-hoc refinement (2026-04-22) — three latency regimes observed. Follow-up warm-latency measurements after the initial cold-start characterization distinguish a middle regime that the README's "~2 min cold / ~5 s warm" phrasing collapses:
| Regime | Latency | Cause |
|---|---|---|
| Cold-start (container spin-up) | 89–129 s | Python imports, model load (per the table above) |
Wake-from-idle (first /ask after container is up) |
~6.7 s (≈ +2 s over warm) | Residual lazy init on first request |
| Steady-state warm | ~5 s (mean 4.74 s, n=5) | LLM API round-trip + retrieval |
Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim captures regimes 1 and 3; the ~2 s first-hit penalty between them is a refinement of the warm-cold boundary, not a contradiction. The v1.1 projection of "+8–15 s first request after any cold wake" measures the delta from this ~2 s current baseline, not from a zero-penalty starting point — the true v1.1 UX cost over current behavior is closer to +6–13 s.
False-premise questions come in two flavors
When authoring golden-dataset questions whose premise is wrong, the question can point at one of two genuinely different failure modes. Both are valid; they test different pipeline paths and should be labeled distinctly so the evaluator routes correctly.
Flavor A — pure refusal. The premise is not addressed anywhere in
the corpus. Example: "How do I configure Claude API rate limits in
Kubernetes?" K8s has no such concept. Schema: category: "out_of_scope",
expected_sources: [], source_snippets: []. The evaluator's
grounded_refusal metric expects the answer to contain a refusal
phrase ("does not contain", "no information") AND cite zero sources.
Tests the pipeline path where retrieval correctly returns nothing
useful and the agent correctly declines.
Flavor B — documented negative. The corpus contains an explicit
negative answer. Example: "How do I configure NetworkPolicy to enforce
mTLS?" The K8s NetworkPolicy docs have a "What you can't do with
network policies" section that explicitly says "Anything TLS related
(use a service mesh or ingress controller for this)". Schema:
category: "retrieval", question_type: "false_premise",
expected_sources: [<the negative-answer page>], source_snippets: [<the verbatim negative statement>]. The evaluator expects the agent
to retrieve the page, find the negative statement, and answer
negatively with a citation. Tests the stricter path where the corpus
genuinely contains the answer and the agent must not hallucinate a
contradictory capability.
Why both matter for the honest-evaluation brand. Grounded refusal
is not "refuse when retrieval is weak." It is "answer exactly what the
source says, including when the source says no." Flavor A tests the
first half (refuse when there is nothing to ground on); flavor B tests
the second half (report the documented negative instead of
confabulating a positive). The K8s golden dataset includes at least
one of each. The first K8s pilot (k8s_pilot_005, NetworkPolicy
mTLS) is flavor B. Flavor A is reserved for questions targeting
features that genuinely do not exist in the K8s corpus; at least one
such question is required in the full 25-question set.
Pilot_005 refusal-gate + agent-behavior measurement
The first K8s pilot run surfaced two distinct flavor-B failure modes
on k8s_pilot_005 (NetworkPolicy mTLS). Both are empirical, both
have specific numbers, and both are logged in
results/k8s_pilot_threshold_0.02.json and
results/k8s_pilot_threshold_0.015.json.
Failure mode 1 — threshold calibration (at 0.02). The
SearchTool.execute() refusal gate fired with max_score=0.01639 —
exactly 1/(60+1), the rank-1 RRF score from a single fusion system.
BM25 hit "NetworkPolicy" at rank 1; the dense encoder contributed
nothing, because "Anything TLS related (use a service mesh or ingress
controller for this)" is a single negative sentence, not a conceptual
topic the page is semantically "about." Hybrid fusion inherited only
the BM25 rank-1 score. At threshold 0.02 (the FastAPI working value),
the gate refused before the agent saw any chunks. Retrieval P@5 and
R@5 both 0.00; answer is a generic refusal.
Failure mode 2 — agent behavior on documented negative (at 0.015).
With the threshold dropped just below the measured max score
(0.015 < 0.01639), retrieval is perfect: P@5 1.00, R@5 1.00, all
five top chunks from k8s_network_policies.md. But the agent still
produces a flavor-A-style refusal: "The Kubernetes documentation
does not provide specific instructions on configuring a NetworkPolicy
to enforce mutual TLS..." The "Anything TLS related" sentence is in
the retrieved chunks — the agent simply treats the absence of
positive instructions as grounds for refusal, rather than reading the
explicit negative sentence and citing it as the answer. KHR 0.67: the
service mesh and ingress controller keywords (the documented
alternatives the page points to) are missing from the answer.
Implication. The flavor-B mechanism requires more than threshold tuning. Fixing the gate is necessary but not sufficient. The system prompt needs a flavor-B clause (e.g., "if the documentation explicitly says a feature does not exist or is not supported, report that with citation — do not treat it as unanswerable"), or the K8s golden dataset's flavor-B questions must use phrasing the current prompt can route correctly. The 0.30 placeholder value from the design doc was based on "prefer conservative" intuition without empirical grounding — the measured working range for K8s pilot retrieval is lower by more than an order of magnitude than that intuition, and even at the working threshold the prompt layer is the blocker.
What this measurement is. A pilot smoke-test result, not a benchmark claim. Aggregates at 0.02: P@5 0.63, R@5 0.83, KHR 0.69. Aggregates at 0.015: P@5 0.80, R@5 1.00, KHR 0.75. Five of six pilots produce substantively correct answers on K8s content under the working threshold — evidence the retrieval stack generalizes to K8s. The pilot's job was schema validation + calibration evidence, not launch metrics. Launch metrics come from the 25-question K8s golden set with tuned threshold and (likely) a revised system prompt, sequenced after this pilot.
Evaluation-layer multi-corpus support lagged the serving-layer refactor
The Tasks 1–8 multi-corpus refactor wired corpora through
app.state.corpus_map and the /ask serving route. scripts/evaluate.py
was not touched and remained single-corpus — it read
config.rag.store_path and config.evaluation.golden_dataset
directly, with no awareness of the corpora dict. This was an
accurate scoping of the refactor (serving-layer, not eval-layer) but
the gap was not surfaced in the original task list.
The K8s pilot commit adds --corpus <name> to scripts/evaluate.py,
routing through config.corpora[name] for store_path,
refusal_threshold, and a new optional golden_dataset field on
CorpusConfig. Without --corpus, the legacy single-store path is
preserved for backward compatibility with make evaluate-fast and
any existing invocations.
CorpusConfig.golden_dataset is str | None = None — optional
rather than required — because two legitimate states exist: corpus
has a golden dataset (FastAPI, K8s post-authoring), and corpus has no
golden dataset yet (any corpus during bring-up). The CLI errors
cleanly with "corpus '' has no golden_dataset configured"
when the field is None, rather than requiring all corpora to ship
with datasets.
Deferred: path-preserving ingestion
scripts/ingest.py uses doc_path.glob("*.md") (non-recursive) and
stores the bare filename as the chunk's source field. This forces
a flat-namespace convention: FastAPI ships as fastapi_*.md, K8s
ships as k8s_*.md, and golden dataset expected_sources are
filename stems. The path-preserving alternative (recursive rglob
plus relative-path source IDs, e.g., concepts/workloads/pods) was
evaluated during the K8s pilot planning and explicitly deferred. The
root-cause refactor would have required FastAPI re-ingestion and a
rewrite of the FastAPI golden dataset's expected_sources — trading
certain regression risk on a green baseline (288 tests, citation
accuracy 1.00 on API providers) for speculative legibility benefit
on K8s authoring.
The source_pages field on GoldenQuestion preserves the
human-readable path anchor separately from the machine identifier,
so the deferral does not lose information. Authors see both
expected_sources: ["k8s_pods.md"] (what the evaluator matches on)
and source_pages: ["concepts/workloads/pods"] (where the content
came from on kubernetes.io) in the same question record.
Pattern marker, not a promise. This is the second visa-timeline
deferral of a root-cause refactor in favor of a minimal-blast-radius
fix; the first was the Mar 25 → Apr 12 P@5 slide bisection. Both
deferrals were deliberate, not forgetting. Not scheduled until
post-launch; marker only. Post-launch scope: modify ingest.py to
rglob + relative-path source IDs, re-ingest FastAPI, rewrite both
golden datasets' expected_sources to path-style. Estimated 3h.
K8s refusal_threshold empirical calibration — 0.02 → 0.015
Change. configs/default.yaml, corpora.k8s.refusal_threshold:
0.02 → 0.015. Single-line config change, pilot-corpus only.
FastAPI threshold unchanged.
Empirical evidence. Diagnostic instrumentation of k8s_pilot_005
("How do I configure a Kubernetes NetworkPolicy to enforce mutual
TLS (mTLS) between Pods in the same namespace?") captured the
retrieval gate firing at max_score = 0.01639344262295082 — exactly
1 / (60 + 1), the algebraic floor for a single rank-1 BM25 hit
under RRF with rrf_k = 60, dense contribution zero. At
refusal_threshold = 0.02, pilot_005 tripped the gate and short-
circuited before retrieval chunks reached the agent. At
refusal_threshold = 0.015 (one tick below the measured floor), the
gate releases and retrieval proceeds. The 0.015 value is not a
tuning guess — it is the nearest round-number floor below the
observed gate-fire value for the single worst pilot in the set.
Validation. results/k8s_preedit.json captures the full 6-pilot
run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean
tool_calls_made 1.167. All six questions receive retrieval; no
gate-fire short-circuits. pilot_005 still refuses as a separate
downstream issue (see next entry when the counterfactual-query fix
lands); that is not a threshold problem.
Scope of this commit. K8s only. FastAPI refusal_threshold
(0.02) is not affected and FastAPI baseline is not re-measured.
Launch-intent 0.30 placeholder for K8s remains as a comment
marker; the full threshold sweep against the 25-question golden set
replaces 0.015 with a properly-tuned value in a later commit. 0.015
is the pilot-floor safety value, not the production-target value.
Why this is a separate commit from the prompt revision. The
threshold calibration is empirically grounded on its own — it
removes the 0.01639 gate-fire blocker, which is the precondition for
any downstream evaluation of pilot_005's actual agent behavior. The
prompt revision addresses a different failure mode surfaced once
the gate releases (agent search strategy is monotone positive-
framing). Two independent changes must not entangle in one commit;
if the prompt revision fails its regression gate and is reverted,
the threshold calibration should stand on its own empirical merit.
Feedback memory feedback_fix_before_sweep.md applies recursively:
fix measurement-affecting bugs at every layer before combining
fixes into single experiments.
Prep for counterfactual-query prompt regression — pin, wire, tolerances
Three sub-changes bundled as one prep commit, each small and in service of making the downstream regression measurement valid.
1. OpenAI model pin. agent_bench/core/provider.py:208 changes
self.model = "gpt-4o-mini" → self.model = "gpt-4o-mini-2024-07-18".
The unpinned alias is a known drift vector — the Mar 25 → Apr 12 P@5
slide bisection is an already-open parallel track item traceable to
silent alias migration. A regression run that uses the alias across
pre-edit and post-edit phases conflates prompt-clause effect with
model drift, even within a single session if the alias happens to
roll between runs. Pinning the dated snapshot removes the variable.
Pricing dict in configs/default.yaml gets a matching
gpt-4o-mini-2024-07-18 entry so the cost-lookup at
provider.py:209 still resolves. Tests that pin the model string
live in mock response payloads (not outgoing assertions) and the
langchain baseline (separate code path) — neither affected.
2. FastAPI multi-corpus eval wiring. configs/default.yaml
adds corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json.
The production serving path at routes.py:105-120 _resolve_system_prompt
already routes /ask and /ask/stream through format_system_prompt(label)
from core/prompts.py — the app.state.system_prompt legacy fallback
(serving/app.py:276) is effectively dead code given the shipped multi-corpus
config. The only remaining caller of task.system_prompt is the
scripts/evaluate.py legacy branch used by make evaluate-fast. Adding
the missing golden_dataset field makes --corpus fastapi work so the
regression gate can measure the actual production prompt path, not the
legacy eval-scaffolding prompt. Purely additive; zero blast radius on
serving (serving doesn't read golden_dataset).
3. Pre-committed four-metric tolerances. Written down now, before the post-edit runs, so the pass/fail call on the counterfactual-query prompt clause is not a judgment under confirmation-bias pressure. Applied identically to FastAPI and K8s:
| Metric | Pass criterion |
|---|---|
| P@5 | post-edit ≥ pre-edit − 0.02 |
| R@5 | post-edit ≥ pre-edit − 0.02 |
| Citation accuracy | post-edit ≥ pre-edit (hard gate — any drop blocks commit) |
Mean tool_calls_made |
post-edit ≤ pre-edit + 0.30 |
| Individual question cap | no question that used fewer than max_iterations=3 iterations pre-edit may hit the cap post-edit |
pilot_005 strict flip criterion (K8s-only):
keyword_hit_rate ≥ 0.60against golden keywords["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]- Answer cites
k8s_network_policies.md - Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
- Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")
Baseline reference: K8s pre-edit numbers from results/k8s_preedit.json
at commit 125dac0 — P@5 0.80, R@5 1.00, citation 1.00 (all 6),
mean tool_calls 1.167. FastAPI pre-edit reference established by
results/fastapi_preedit.json in the next step of this session,
same pinned ID, same refusal threshold (0.02).
Rationale for bundling. All three sub-changes answer "what must be true before the regression measurement is valid" — drift control, evaluation path, decision criteria. Splitting into three commits would add noise without adding signal. None of them change the prompt template itself; the prompt edit is the NEXT commit and is the sole experimental variable the regression measures.
Fix 1 (prompt-level counterfactual clause) attempted and reverted
Outcome. K8s regression clean on every metric (P@5, R@5, KHR,
citation, mean tool_calls all within tolerance or unchanged); K8s
pilot_005 flipped from refusal to documented-negative-with-citation
as designed (KHR 0.67 → 1.00, answer contains both "service mesh"
and "ingress controller", cites k8s_network_policies.md).
FastAPI regression failed on the iteration-inflation tolerance:
mean tool_calls_made 1.111 → 1.556 (delta +0.444, gate +0.30),
and two retrieval questions (q024, q025) were pushed from 1 pre-edit
tool call to 3 post-edit tool calls (hitting max_iterations=3
cap), violating the pre-committed "no new cap-hits from sub-cap
baseline" criterion.
Correctness metrics on FastAPI all held. Citation accuracy stayed at 1.000 / 1.000 across all 27 questions. P@5 delta −0.007, R@5 delta 0.000, KHR delta +0.006. The failure is purely process inflation, not output regression. q024 and q025 produce identical P@5/R@5/KHR/citation numbers pre and post despite the cap-hit — the orchestrator's "max iterations hit → one final complete() without tools" path happened to keep answers correct, but that is observation, not structural protection.
Failure mode. The clause's trigger condition — "your first search returned documentation about the subject of the question without addressing the specific capability or feature the user is asking about" — relies on subjective LLM judgment about whether retrieved content "addresses" a capability. The judgment is fuzzy on compound multi-topic questions where the first search returns partial-topic coverage. q024 asks about "Docker + Gunicorn workers
- health checks + Pydantic Settings"; first search returns Docker content, LLM reads "documentation about the subject without addressing the specific capability," fires the follow-up with negative framing, gets nothing useful, does a third normal search to cover the remaining topics, hits the cap. Same pattern on q025. Over-firing on this class of question is an inherent fragility of prompt-level LLM-judged triggers; a wording refinement might narrow the misfire rate but cannot eliminate it as long as the judgment itself is fuzzy.
q023 vs q024/q025 asymmetry is a useful signal for Fix 2. q023
is a pre-existing 3-tool-call compound question ("custom error
handling + CORS middleware + structured testing with dependency
overrides"). Under the prompt clause, q023 was unchanged — the
clause did not fire on it — while q024 and q025, structurally
similar compound questions, were pushed into 3-tool-call cap-hit.
The difference is not in question structure but in how the LLM
interpreted the first-search return for each. That asymmetry is
the precise reason a deterministic trigger is the right next step:
any Fix 2 / Fix 3 candidate should be unit-testable against
(pilot_005, q023, q024, q025) — the right fix must fire on
pilot_005 and behave predictably on all three compound questions
(either fire on all of them or none of them, but not pick them
selectively by LLM whim).
Gate discipline honored. The pre-committed FastAPI tolerances
fired for exactly the reason the pre-commitment was designed:
catching process-metric regressions before they ship. Tolerance-
relaxation post-hoc would burn the session's strongest discipline
artifact (pre-committed-tolerances + honored-gate) for marginal
ship-this-approach EV. The narrow pilot_005 finding does not
evaporate with the revert — chunk 63 (d0806d5da91d6026) is real,
the negative-framing retrieval is reproducible, and Fix 2 will
surface the documented negative the same way via a deterministic
path.
Fix 2 deferred to a later session. Deterministic query
expansion at the SearchTool layer: when a search_documents
call returns no chunk containing a direct answer string, issue a
second internal search with negative-framing keywords and merge
results before returning to the orchestrator. Offline-testable,
corpus-agnostic, no LLM judgment required, no iteration-budget
impact (the double-search happens inside a single tool call, not
across iterations). Unit-testable against the
(pilot_005, q023, q024, q025) asymmetry as an acceptance fixture.
Evidence retained. Four result JSONs in results/ document the
regression measurement at the pinned gpt-4o-mini-2024-07-18
snapshot in this session:
fastapi_preedit.json— 27 questions, HEAD prompt, 0.02 thresholdfastapi_postedit.json— 27 questions, clause prompt, 0.02 threshold (gate-failing run)k8s_preedit_pinned.json— 6 pilots, HEAD prompt, 0.015 thresholdk8s_postedit.json— 6 pilots, clause prompt, 0.015 threshold (gate-passing run, pilot_005 strict flip confirmed)
The previously-committed results/k8s_preedit.json (from 125dac0)
is also a valid K8s-pinned measurement at the session-equivalent
snapshot and remains the canonical threshold-commit evidence.
Held DECISIONS.md drafts stay held. The counterfactual-query
finding draft (to be updated when Fix 2 lands) and the threshold-
calibration entry already committed at 125dac0 are both correct
in scope. The narrowed serving-migration deferral entry (tied to
any external reference to the counterfactual-query fix) also stays
deferred until Fix 2 lands, since the production/eval-harness
prompt divergence is unchanged by this revert.
Fix 2 pre-committed regression gate — SearchTool deterministic query expansion
Pre-committed BEFORE post-edit runs (same discipline pattern that caught Fix 1's iteration inflation cleanly).
Mechanism under test. agent_bench/tools/search.py
SearchTool.execute gains a deterministic two-query retrieval
path. When the primary retrieval passes the refusal gate, a
secondary retrieval is issued against an expanded query
(original_query + " not supported limitations cannot"), and the
final context returned to the LLM is primary_top_3 ++ secondary_top_5 deduplicated by chunk.id. Both retrievals run
inside a single SearchTool.execute call — from the LLM's
perspective, the tool schema, name, parameters, and return shape
are unchanged, and the iteration budget is untouched.
Why this is architecturally different from Fix 1. Fix 1 placed
a behavioral clause in the system prompt that told the agent to
issue follow-up searches itself. The trigger was an LLM judgment
("did the first search return content addressing the specific
capability?") and the follow-up was a separate tool call, so it
counted against max_iterations. Over-firing on compound questions
inflated iteration counts and pushed q024/q025 to the cap. Fix 2
replaces this with a deterministic trigger (primary passes gate),
a fixed expansion suffix, and a merge that happens entirely inside
one tool call. No LLM judgment; no iteration change; corpus-
agnostic.
Suffix choice. " not supported limitations cannot". Keyword-
dense, ungrammatical on purpose — the suffix exists to shift BM25
and embedding mass toward "what you cannot do" / "limitations"
sections, not to read well. The ungrammatical form is also a self-
documenting signal in retrieval logs: anyone reading a query trace
sees the suffix and immediately knows it is a synthetic expansion,
not user input. A one-line comment in search.py preserves the
rationale for future readers.
Merge choice. primary_top_3 + secondary_top_5 deduped by
chunk.id, producing 5–8 unique chunks per call. Rationale: top-5
primary would make the expansion redundant on high-overlap queries
(defeating the mechanism), while primary-top-3 guarantees the
expansion always contributes to the final context window. Probe
data (/tmp/probe_fix2_v2.py, throwaway) confirms this merge
strategy surfaces pilot_005's target chunk
(d0806d5da91d6026, chunk_index 63, "Anything TLS related ... use
a service mesh or ingress controller for this") at position 6–8 in
the merged list.
Opt-in flag, defaulting ON. SearchTool accepts
negative_framing_expansion: bool = True. Default is the shipping
configuration because the regression gate must measure the shipping
behavior, not the no-op path. A False default would mean the gate
validates an unused parameter, and a subsequent commit flipping the
default would have no regression evidence. Kill switch is preserved
via explicit False at construction if a future regression
requires an A/B comparison.
Baseline reuse. The Fix 1 session's pre-edit JSONs
(results/fastapi_preedit.json, results/k8s_preedit_pinned.json,
both committed at 213da36) were measured under the currently-
committed state of the repo: pinned gpt-4o-mini-2024-07-18, K8s
threshold 0.015, FastAPI threshold 0.02, HEAD prompts.py with no
clause, HEAD search.py with no expansion. The working tree
verification confirms this state is unchanged. These JSONs are
therefore reused as the Fix 2 pre-edit baseline and do not need to
be re-measured. Only post-edit runs are required for the Fix 2
regression (~$0.02 saved).
Pre-committed tolerances.
| Metric | Pass criterion |
|---|---|
| P@5 | post-edit ≥ pre-edit − 0.02 |
| R@5 | post-edit ≥ pre-edit − 0.02 |
| Citation accuracy | post-edit ≥ pre-edit (hard gate — any drop blocks commit) |
Mean tool_calls_made |
post-edit ≤ pre-edit + 0.05 (design-correctness gate — see note) |
| Individual cap-hit | no question that used fewer than max_iterations=3 iterations pre-edit may hit the cap post-edit |
Note on the tool_calls gate. ≤ +0.05 is a design-correctness
gate, not a performance gate. Fix 2's invariant is that both
retrievals happen inside one SearchTool.execute call, so the
LLM's iteration count is unchanged by construction. Any non-trivial
movement in mean tool_calls_made indicates the design invariant
is broken — e.g., expansion accidentally exposed as a separate
tool, or the LLM observing two-call behavior and adapting its
strategy. The gate fires on design violation, not on performance
regression. The 0.05 absolute threshold absorbs legitimate run-to-
run variance from non-determinism in the LLM even at temperature
0, without absorbing real iteration-count movement.
pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1 gate):
keyword_hit_rate ≥ 0.60against golden keywords["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]- Answer cites
k8s_network_policies.md - Answer contains "service mesh" OR "ingress controller"
- Answer does NOT begin with refusal phrasing
Baseline reference for the gate.
| Corpus | Pre-edit source | P@5 | R@5 | Citation | Mean tool_calls |
|---|---|---|---|---|---|
| FastAPI (27) | results/fastapi_preedit.json @ 213da36 |
0.585 | 0.679 | 1.000 | 1.111 |
| K8s (6 pilots) | results/k8s_preedit_pinned.json @ 213da36 |
0.800 | 1.000 | 1.000 | 1.167 |
Post-edit filenames (to be produced).
results/fastapi_postedit_fix2.jsonresults/k8s_postedit_fix2.json
If the gate passes: commit Fix 2 with search.py change, unit
tests (including the tool-spec snapshot test), the two post-edit
result JSONs, and this DECISIONS.md entry extended with the
regression outcome.
If the gate fires: revert, document the failure mode, surface the specific criterion that fired. No tolerance relaxation — same discipline pattern as Fix 1 revert.
Fix 2 outcome — mechanism works, response-style criterion fired, reverted
Regression runs produced. Two post-edit runs on K8s (FastAPI not run — K8s findings gated the decision before API spend on the broader set):
| Run | Merge rule | File | Purpose |
|---|---|---|---|
| Fix 2 v1 | primary[:3] + secondary[:5] |
results/k8s_postedit_fix2.json |
Initial implementation |
| Fix 2 v2 | primary[:5] + secondary[:5] |
results/k8s_postedit_fix2_merge_v2.json |
Path A refinement after v1 failed P@5 on a metric-definition mismatch |
v1 findings. Aggregate: P@5 0.800 → 0.767 (Δ −0.033, FAILED
the P@5 ≥ −0.02 tolerance). The failure traced to a merge-rule /
metric-semantics interaction: retrieval_precision_at_k computes
precision on retrieved_sources[:5], and with primary[:3] + secondary[:5] the first 5 entries were primary_top_3 + secondary_top_2. For pilot_005, secondary[1] was
k8s_pods.md (chunk_index 40, surfaced because the reranker
matched its "localhost communication" content against the expanded
query). That single off-source chunk in position 5 dropped P@5
from 1.00 to 0.80 for pilot_005 and similarly for pilot_006.
Iteration invariant held (tool_calls 1.167 → 1.167). Citation
accuracy held (1.000 → 1.000). Target chunk
(d0806d5da91d6026, "Anything TLS related") reached the LLM
context for pilot_005 at merged position 7.
Path A refinement (merge v2). Change primary[:3] + secondary[:5] → primary[:5] + secondary[:5]. Rationale:
primary_top_5 is preserved in positions 1–5 by construction, so
P@5 computed on ranked_sources[:5] is unchanged from the
no-expansion baseline. Expansion chunks land in positions 6–10.
Target chunk still reaches LLM context (position 9 for pilot_005).
This is an implementation refinement, not a tolerance
relaxation — the pre-committed gate thresholds stand; only the
merge rule was adjusted to respect the metric's window semantics.
v2 findings — perfect metric preservation, but strict-flip fails on response style.
Aggregate:
| Metric | Pre-edit | Fix 2 v2 | Delta |
|---|---|---|---|
| P@5 | 0.800 | 0.800 | 0.000 |
| R@5 | 1.000 | 1.000 | 0.000 |
| KHR | 0.806 | 0.806 | 0.000 |
| Citation accuracy | 1.000 | 1.000 | 0.000 |
Mean tool_calls_made |
1.167 | 1.167 | 0.000 |
Every aggregate metric literally unchanged. Per-question deltas: zero on every metric, every question. The design invariant (iteration budget unchanged, tool schema unchanged, refusal gate behavior unchanged) holds perfectly.
But pilot_005 strict flip fails on the refusal-phrasing criterion. Post-edit answer:
"The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. For mTLS, it is generally recommended to use a service mesh or other proxy solutions, as NetworkPolicy alone does not handle TLS configurations directly [source: k8s_network_policies.md]."
The answer substantively contains the documented negative with citation. But it opens with "The Kubernetes documentation does not provide specific instructions..." — the exact refusal- phrasing opener the strict-flip criterion was pre-committed to reject. The criterion exists because the brand is honest evaluation: an answer that opens apologizing that the documentation "does not provide specific instructions" reads, to a technical reviewer, like the system failed to find the answer and is papering over the gap, even though the facts and citation are present. The criterion fired as designed.
Compare to Fix 1 post-edit answer (from 213da36 evidence):
"Kubernetes NetworkPolicy does not support enforcing mutual TLS (mTLS) directly. The documentation states that anything TLS related should be handled using a service mesh or ingress controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."
Fix 1's answer asserts a fact about NetworkPolicy ("does not support"); Fix 2's answer asserts a fact about the documentation ("does not provide instructions"). The first forecloses the capability; the second leaves open whether the capability exists somewhere the system didn't see. That distinction is load-bearing for any grounded-refusal narrative, and it separates a system that handles documented negatives crisply from one that hedges around them.
Diagnosis. Fix 2's mechanism successfully gets the target chunk into the LLM's context window — the retrieval side of the problem is solved. What Fix 2 cannot provide is explicit guidance on how to phrase the documented negative once the chunk is present. Fix 1's prompt clause was doing that guidance work; removing the clause and relying on the LLM's unaided response style produces a hedging answer because the LLM, seeing both NetworkPolicy-spec content and a TLS limitation bullet, defaults to contextual hedging rather than crisp assertion.
Fix 2 is therefore not an alternative to Fix 1's prompt clause — it is a prerequisite. Fix 2 guarantees the chunk reaches context; a future "Fix 2 + targeted prompt clause" stack could resolve both the retrieval gap and the response-style gap without Fix 1's over-firing problem, because the clause would no longer need to direct the agent to do a follow-up search (Fix 2 handled that). The over-firing on compound questions that broke Fix 1 was caused by the agent deciding to do extra search iterations under LLM judgment; if the expansion already happened deterministically inside the first tool call, the clause has less work to do and may not trigger the second-LLM-call pattern at all. Speculative and not for this session. Future work item.
Gate verdict: failed on pilot_005 strict flip criterion. Reverting, same Fix-1 pattern.
What this commit contains.
agent_bench/tools/search.pyreverted to HEAD (no Fix 2 code changes)tests/test_tools.pyretains theMockChunk.idhygiene fix (the realChunkclass hasid; mock should match the real API for future test authors)tests/test_tools.pyaddsTestSearchToolSpecSnapshot: a general-purpose guard that freezesSearchTool's LLM-facing contract (name, description, parameters). The lesson from Fix 2 is that any future refactor exposing internal SearchTool state to the LLM would break iteration-budget invariants — the snapshot test catches that at test time, independent of whether Fix 2 lands.- Two regression evidence JSONs:
results/k8s_postedit_fix2.json(v1, the P@5 failure) andresults/k8s_postedit_fix2_merge_v2.json(v2, the strict-flip failure). Retained as the measurement trail behind the revert decision. - This DECISIONS.md entry (pre-committed gate + outcome + revert narrative).
What this commit does NOT contain. No changes to
agent_bench/tools/search.py, agent_bench/core/prompts.py, or
configs/default.yaml. Both Fix 1 (prompt clause) and Fix 2
(SearchTool expansion) have been attempted and reverted this
session. Three commits of progress nonetheless: 125dac0
(threshold calibration, empirical), 5c1f49f (prep bundle: model
pin + fastapi wire + Fix 1 pre-committed tolerances), 213da36
(Fix 1 revert narrative). The threshold calibration and model pin
are real, shipped, measurement-grounded infrastructure changes.
The two fix attempts are documented learning that shapes the
future direction.
grounded_refusal metric reads answer text, not retrieved sources — 2026-04-14
Context. Week 1 step 5 authoring (25-question K8s golden set). Two
flavor-A out-of-scope questions (k8s_004 Jaeger sidecar, k8s_024
Envoy xDS ADS) surfaced a pre-existing bug in the
grounded_refusal metric during the functional check.
Bug 1 — wrong signal. The metric's docstring said it checks
whether the answer correctly refuses AND cites no sources, but the
implementation was checking len(response_sources) == 0 where
response_sources is the retrieved-sources list. Real agents
retrieve candidates on any non-trivial OOS query (the grounded-refusal
gate at tool level only catches the thinnest queries), inspect the
candidates, find nothing relevant, and refuse in the answer text
without citing anything. Checking retrieval emptiness flagged those
correct refusals as failures. Fix: inspect the answer text for
[source: X.md] citations via regex; drop the response_sources
parameter from the signature entirely.
This was a silent false negative on all 5 fastapi out-of-scope
questions (q008–q010, q026–q027) which all correctly refuse
but were being marked grounded_refusal=False. Aggregate
refusal_rate in report.py shifts by the resulting 5-question
delta; any historical comparison to pre-fix fastapi numbers needs
to acknowledge this.
Bug 2 — metric coverage gap surfaced during 25-question authoring.
grounded_refusal_rate recognized "does not contain information"
phrasing (in refusal_phrases list) but missed "not in the
{corpus_label} documentation" phrasing — the exact shape taught by
the system prompt at core/prompts.py:17-18. The LLM produced the
canonical form on some questions and the phrase-list form on others;
the metric inflation/deflation was non-deterministic. Fix: narrow
regex \bnot in the\b[^.]{0,60}\bdocumentation\b added alongside
phrase-list matching.
Rejected alternative. Substring "not in the" would produce
false positives on valid-answer phrasing — "the rate limit is not in
the same scope as the request timeout", "the flag is not in the 1.28
release; it landed in 1.29", "this value is not in the default
range" — all of which are legitimate retrieval answers with
conditional or scope-limiting language, not refusals. Honest
evaluation cannot afford a metric that silently counts these as
grounded refusals.
Tests. Two unit tests pin both directions:
test_canonical_refusal_phrasing_recognized covers the positive
case ("The answer is not in the Kubernetes documentation"), and
test_not_in_the_is_not_substring_refusal covers the negative case
("The rate limit is not in the same scope as the request timeout").
The negative test is the load-bearing one — without it, a future
refactor could silently widen the matcher back to substring and pass
all existing tests. The negative test pins design intent.
Scope bound. This is a metric correctness fix, not a threshold
change. The 0.015 refusal-gate threshold (calibrated in 125dac0
against the 6-question pilot) is unchanged by this commit. Whether
the corrected metric shifts the optimal threshold against the full
25-question set is a question for the threshold-sweep session, not
this authoring session.
Parallel tracks / deferred items — 2026-04-14
Tracked list of work items that are deferred to parallel sessions. Each item has a reason for deferral and a rough scope boundary so the session that picks it up has the context to pre-commit tolerances and decision criteria before measuring.
routes.py:552audit-logger semantics unification. The serving layer's audit record field still uses the pre-fixgrounded_refusal = not bool(sources)expression, which disagrees with the evaluation metric's answer-text-based definition. Not surfaced to the dashboard (audit log only), but external reviewers who reference audit records for runtime verification would see a different definition than the benchmark claims. Fix: callgrounded_refusal(answer, category)frommetrics.pydirectly. When this lands, the "grounded_refusal metric" DECISIONS.md entry above should get a one-line addendum noting the unification.Full 25Q threshold sweep → production-target
refusal_thresholdfor K8s. The 25Q set exists, the metric is correct. Sweep against the full set, compare to pilot-floor 0.015, pick the production-target value, updateconfigs/default.yamlplaceholder comment. Pre-commit before measuring: sweep range, decision criteria, tolerances. Do not entangle with flavor-B response-style work below — those are independent axes.Flavor-B response-style class (pilot_005 + k8s_022). Two independent reproductions of "LLM refuses when documented negative is in retrieved context". Retrieval is healthy on both; the gap is prompting. Future session: Fix 2 (counterfactual-query expansion in
SearchTool) + targeted prompt clause stacked — previously speculative in the Fix 2 revert entry, now addresses a documented reproducible class. Two reproductions, not one-off.Serving-migration deferral. Tied to external references to the counterfactual-query fix. Unchanged from prior sessions.
agent-bench→refusal-benchrename — CLOSED 2026-04-14. Decision: keepagent-bench, reframe via tagline. The original concern was name collision with AgentBench (Liu et al., ICLR 2024,1000 citations). Due-diligence at launch time: the name is350 internal references across ~60 files, two external account renames, one HF Space URL break with no redirect) for a naming-precision benefit that isn't supported by the actual scope — the benchmark measures retrieval, grounding, multi-hop, citation accuracy, and refusal as seven axes, not refusal alone. Tagline reframe captures the honest-evaluation positioning without the rename cost:agent-bench(hyphenated) vs.AgentBench(camelcase), which are distinct identifiers across GitHub, arXiv, and PyPI. The two projects target different audiences (LLM-as-agent capability vs. RAG+refusal benchmark) and any reviewer reaching the repo via LinkedIn or CV sees the scope in the README within seconds. Rename cost is substantial ("A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation." HF Space rename (
Nomearod/agentbench→Nomearod/agent-benchfor GitHub-name consistency) is a separate, smaller follow-up deferred approximately one week. Reason: several job applications submitted the preceding week reference the current HF URL (nomearod-agentbench.hf.space); renaming the Space now would break those inbound links with no HF-side redirect. The rename absorbs cleanly once the application wave lands and the reference window expires. Until then the README, dashboard, and DECISIONS.md continue to reference the currentagentbenchURL; launch-adjacent work (Post #1, screenshots, cold-start measure) uses the current URL and will be updated in a single small follow-up commit when the rename happens.OpenAI snapshot drift bisection. Mar 25 → Apr 12 P@5 slide; the model pin at
5c1f49f(gpt-4o-mini-2024-07-18) removed the ongoing drift risk, so any future measurement is apples-to- apples. The original bisection is still unresolved but cheap at this point — tractable whenever there is session capacity, low urgency because the pin protects forward runs.Fix 2 revert commit SHA missing from the Fix 2 outcome entry. The "Fix 2 outcome — mechanism works, response-style criterion fired, reverted" DECISIONS.md entry describes the revert narratively but does not cite the revert commit's SHA (post-rewrite:
27c2e17—docs(eval): Fix 2 SearchTool query expansion — attempted and reverted). Add retroactive SHA reference in the next docs pass. Not urgent; noted so the narrative-without-SHA pattern does not spread to other entries. Lesson going forward: prefer explicit SHAs over positional references like "this commit" / "commit above" in DECISIONS.md entries — positional references do not survive history rewrites as robustly as SHA references do.
K8s refusal_threshold sweep against 25-question golden — 2026-04-14
Override notice. This sweep ran in the same session as the
25-question authoring + grounded_refusal metric fix (4454894),
after I explicitly flagged that the parallel-tracks guidance from
earlier in the session recommended waiting for a fresh session with
pre-commitment discipline. The user issued an explicit override:
"proceed on best-judgment sweep range and criteria" — logged here
for audit trail. The pre-commitment frame below was drafted BEFORE
running any sweep value, not after. The decision criteria were
locked before the first data point was observed, not retrofitted.
Sweep grid. 4 threshold values: 0.010, 0.015 (already
measured in .cache/eval_k8s_full25_postfix.json, the post-metric-
fix run from 4454894), 0.020, 0.025.
0.010: one tick below current calibration; sanity-check floor.0.015: current calibration (pilot-floor, one tick below pilot_005's 0.01639 max_score).0.020: matches legacy FastAPI threshold and the original provisional K8s default before the125dac0calibration.0.025: one tick above legacy; exploration of whether aggressive OOS short-circuiting is worth the correctness risk.
Decision criteria (pre-committed).
- OOS refusal must hold. Both
k8s_004(Jaeger) andk8s_024(Envoy xDS) must retaingrounded_refusal=Trueat the chosen threshold — whether the gate fires at the tool level or the LLM refuses after inspecting context doesn't matter, only that the metric reports True. - Retrieval recall must not degrade. Each retrieval-category
question's R@5 at the chosen threshold must be ≥ its R@5 at
0.015(the post-fix-25Q baseline) with a noise tolerance of at most ONE question dropping by at most 0.20. Two or more drops, or any drop > 0.20, disqualifies the value. - Citation accuracy must hold. All questions' citation_accuracy must be ≥ 0.95 at the chosen threshold. One question at 0.80 is noise-tolerated; two or more is a hard stop.
- k8s_022 (flavor-B) retrieval must remain at R@5=1.0. The gap is prompting-side, not retrieval-side; any threshold that breaks the already-working retrieval on flavor-B questions is a regression.
- Pick the highest threshold that satisfies 1–4. Rationale: a higher threshold short-circuits more OOS queries at the tool level, saving a retrieval round trip and an LLM call — this is a real latency and token-cost win when the correctness is held.
- Tie-break. If multiple values all satisfy 1–4, prefer the value closest to a clean round number (0.020 over 0.018) for documentation clarity.
- Floor. If no threshold > 0.015 satisfies 1–4, keep 0.015. No threshold < 0.015 will be chosen regardless — sub-0.015 is strictly less protective than the pilot-floor.
Scope bound. K8s only; FastAPI's refusal_threshold: 0.02 is
unchanged. The flavor-B response-style gap (parallel track #3) is
NOT a sweep variable — changing the threshold does not fix LLM
phrasing; that's the Fix 2 + prompt guidance stacked experiment
the parallel-tracks list already defers.
Measured results. All four runs use the post-metric-fix pipeline
(grounded_refusal metric from 4454894), deterministic mode,
gpt-4o-mini-2024-07-18, same retriever config.
| threshold | avg R@5 | OOS refusal | gate fired on | broken retrieval |
|---|---|---|---|---|
| 0.010 | 0.957 | 2/2 | — | — |
| 0.015 | 0.957 | 2/2 | — | — |
| 0.020 | 0.870 | 2/2 | k8s_006, k8s_007, k8s_024 | k8s_006, k8s_007 (R@5=0.00) |
| 0.025 | 0.913 | 2/2 | k8s_004, k8s_007, k8s_024 | k8s_007 (R@5=0.00) |
Structural finding: LLM query variance makes max_scores non-deterministic.
At 0.020, k8s_006 (ConfigMap, simple) gate-fired → empty retrieval →
R@5=0.00. At 0.025, k8s_006 did NOT gate-fire → 5 sources → R@5=1.00.
A higher threshold producing fewer gate-fires is physically impossible
if retrieval is deterministic — the SearchTool receives different
queries across runs because the orchestrator issues LLM-generated
queries, and the same question can produce different top-k max_scores
run-to-run. k8s_006's max_score for the query the LLM chose lives
somewhere around the 0.018–0.025 boundary; which side of any given
threshold it lands on depends on which query the LLM wrote.
This means any threshold above 0.015 is structurally fragile, not
merely "failed on this run." Even if a run at 0.018 passed, a future
run could gate-fire on k8s_006 or k8s_007 because the query is
non-reproducible. The production threshold needs to sit below all
legitimate simple-question max_scores with enough margin to absorb
LLM query variance.
Decision: keep refusal_threshold: 0.015.
0.010: meets all criteria, identical measured metrics to0.015(avg R@5=0.957, OOS refusal 2/2, no citation fails). Not chosen: lowering strictly weakens the gate's ability to catch low- confidence retrievals without improving any measured metric.0.015: chosen. Meets all criteria and is the highest value that does not degrade retrieval — which is the definition of the correct refusal-gate threshold. Preserving the gate's signal is the gate's purpose;0.015gives maximum gate strength without cost,0.010gives the same measurable behavior with less gate signal, so0.015dominates.0.020: breaks TWO retrieval questions (k8s_006,k8s_007); disqualified per criterion 2.0.025: breaks ONE retrieval question in this run (k8s_007) but the non-determinism finding means a future run could break more. Even ignoring non-determinism, still disqualified by the citation-accuracy-equivalent drop onk8s_007.
Corpus characteristic finding. The 0.020 default inherited from
FastAPI breaks on K8s because K8s retrieval score distributions are
lower for "easy" questions. k8s_006 ("What is a ConfigMap?") and
k8s_007 ("What does a Kubernetes Job do?") are both type: simple
with clean single-source expected answers — exactly the cases where
BM25+embedding scores should be highest. They land at max_scores in
the ~0.018 range, below the FastAPI-calibrated 0.020 default. This
is not an authoring bug — both questions retrieve their
expected_sources correctly when the gate doesn't fire. It's a
corpus characteristic: K8s documentation has more topic-overlap
across pages than FastAPI, diluting top-k concentration.
The 25-question set exposed this because the 6-question pilot had no simple questions with low max_scores — the pilot was drawn from retrieval-stressful areas (comparison, multi-hop, flavor-B). The 25-question authoring deliberately added simple questions to hit the CRAG distribution target (6 simple, 5–6 target), and those simple questions revealed the corpus-characteristic floor.
Config change. configs/default.yaml corpora.k8s.refusal_threshold
comment updated to reference this sweep. Value unchanged at 0.015.
Not in scope. (a) Adding retry-with-query-variance to the
SearchTool to reduce max_score variance — separate session, affects
other corpora. (b) Tuning FastAPI's threshold against its golden
set — the FastAPI default was empirically fine on its own 30Q set
and is not a documented regression. (c) Fixing the k8s_015
R@5=0.50 value observed across all threshold runs — pre-existing
authoring state from 4454894, tracked separately if it becomes
a concern on future runs.
Narrative summary. Session hypothesis: pilot_005 is a counterfactual-query-expansion problem. Session evidence: the hypothesis is correct on retrieval — the target chunk is reachable via negative-framing queries and Fix 2 surfaces it deterministically with zero iteration-budget impact. Session evidence also shows the hypothesis is incomplete — retrieval-only fixes cannot close the response-style gap, because the LLM under unaided prompting hedges when a documented negative is surrounded by unrelated topical content. A future session exploring Fix 2 + targeted prompt guidance stacked is the natural next experiment; this session's pilot-first discipline has been preserved against two distinct pre-committed gates, both firing for the reasons they were designed to catch.
Credential-exposure incident and history rewrite — 2026-04-14/15
Summary. During Week 1 work on the
feat/user-friendly-landing-page-live-dashboard branch, an
instruction.txt file containing plaintext OpenAI and Anthropic
API keys was accidentally committed at pre-rewrite SHA 2b3150f
(style: fix ruff lint — import sorting, line length) and removed
from the working tree in a later commit (pre-rewrite SHA 3a2c5ef,
security: remove instruction.txt containing plaintext credentials).
The removal did not clean git history — the keys remained accessible
via git show 2b3150f:instruction.txt in local history.
Discovery. The issue was discovered when GitHub push protection
rejected the first push of the branch to the origin remote,
flagging the credentials via its secret-scanning system. The branch
had never been pushed to any public remote prior to the rewrite;
the detection fired on the very first push attempt, which is the
correct moment for secret-scanning to act. Honest credit to the
tooling: GitHub's push protection did exactly what it was designed
to do, and the alternative failure mode (silent push of real
credentials to a public repo) did not occur.
Immediate actions, in order.
Key rotation. Rotated both OpenAI and Anthropic keys at the respective provider dashboards, revoking the exposed values immediately. Rotation was confirmed before any git operation ran — the reasoning was that the keys were exposed on the local disk regardless of whether they ever made it to a public remote, so the exposure window needed to be closed first.
Unauthorized-use check. Verified billing/usage dashboards on both OpenAI and Anthropic for the exposure window (from commit
2b3150flanding until rotation). No unauthorized activity observed on either account.Local
.envupdate and smoke test. Updated local.envwith the new keys. Verified both worked via minimal API calls that return only HTTP status codes (never the key values themselves):GET /v1/modelsfor OpenAI (200),POST /v1/messageswith a 1-token request for Anthropic (200). Total verification cost: <$0.0001.Repository backup. Before running any history-rewriting command, backed up the entire repository via
rsync -ato/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>, excluding only.mypy_cacheand.cache(both derivative, regenerable, and explicitly.gitignore'd). The backup preserved.git/, all four worktree state files under.git/worktrees/, the.worktrees/checkouts themselves, and all tracked source files. The backup is the safety net if the rewrite had gone wrong in any way; this session never needed to consult it.History rewrite via
git filter-repo. Rangit filter-repo --path instruction.txt --invert-paths --forceon the main clone. The--forceflag was required because filter-repo's default safety check refuses to run on non-fresh clones; the backup step above mitigates the risk that this flag is usually guarding against. 186 commits were parsed and rewritten in ~2.4 seconds; filter-repo's internal repacking completed in an additional ~5 seconds. Theoriginandhfremotes were automatically unset by filter-repo as its standard safety behavior (and restored from a saved file before the push).Dropped empty commit. Pre-rewrite commit
3a2c5ef(which removedinstruction.txtfrom the working tree but did not clean history) became empty after filter-repo stripped the file from all prior commits and was dropped automatically. This is correct filter-repo behavior: the commit's only net effect was to remove a file that no longer exists in any predecessor, so post-rewrite it has no content change and is elided from the linear history. The total commit count went from 186 → 185. Pre-rewrite SHA3a2c5efmaps to00000...00000in.git/filter-repo/commit-map, indicating the drop. The dropped SHA was not referenced anywhere in DECISIONS.md, so the drop had zero audit-trail impact.Multi-layer verification sweep. Ran six checks across every location where the credentials could still be present: (a)
git log --all --full-history -- instruction.txtreturned empty; (b)git rev-list --all --objects | grep instruction.txtreturned 0 matches; (c)git reflog --allwas empty aftergit reflog expire --expire=now --all; (d)git fsck --unreachablereturned clean; (e)git stash listwas empty; (f) a precise key-value regex scan across all blobs in the rewritten object database (sk-[A-Za-z0-9]{30,},sk-ant-[A-Za-z0-9]{20,}, and env-var-assignment patterns) found 23 matches, all verified to be non-secret content — specifically: 15 historical README.md blobs containing the documentation placeholderANTHROPIC_API_KEY=sk-ant-...(with three literal dots), 7 historicaldocs/provider_comparison.mdblobs with the same documentation placeholder pattern, and 1tests/test_output_validator.pyblob containing test fixtures that intentionally use mock key-shaped strings to verify the output-validator's secret-redaction logic. The precise scan is a meaningful check: it demonstrates that the exposure was isolated toinstruction.txtand did not spread via copy-paste of the key values into other files before removal.Worktree walk. All four worktrees (
feat-infra-sprint,feature-grounded-refusal,langchain-baseline,security-hardening) were checked forinstruction.txthistory pollution and for uncommitted changes. All four were clean — no pollution in any branch's history (filter-repo operates on all refs in a shared.git/, so the worktrees were reached through the main clone's object database) and no local dirty state in any working tree. No worktree deletion or recreation was needed.DECISIONS.md SHA remap. The filter-repo operation rewrote every commit's SHA downstream of the first rewritten commit. This broke every explicit SHA reference in DECISIONS.md because those references pointed to pre-rewrite SHAs that no longer exist. The remap used
.git/filter-repo/commit-mapas the authoritative SHA-based mapping (not message-based pairing, which would have been vulnerable to duplicate-message ambiguity — 2 pairs of commits in the pre-rewrite history did in fact have identical messages, though neither was in the substitution set). Four unique old SHAs were remapped across 18 substitution sites:OLD (pre-rewrite) NEW (post-rewrite) Commit role bd2b913213da36Fix 1 counterfactual prompt clause revert b97f00f125dac0K8s refusal_threshold 0.02 → 0.015 calibration 77017db5c1f49fpin gpt-4o-mini snapshot + wire fastapi golden 526be184454894Week 1 step 5 — 25Q golden + grounded_refusal fix Every message matched exactly across the old→new pairing; no new SHA prefix collides with any old SHA prefix; post-remap grep confirmed zero remaining references to any old SHA.
Exposure scope assessment. The branch had never been pushed to any public remote prior to the rewrite. The credentials existed in:
- Local git history at
/Users/zenith/Desktop/agent-bench/.git/(cleaned) - Four worktree clones sharing the same
.git/(cleaned via the main repo) - The rsync backup at
/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>(to be deleted after this commit and test suite confirm the rewrite is correct)
No external exposure via GitHub, HF Spaces, or any other shared system occurred. No cached CI artifacts contain the keys because CI only runs on pushed branches and this branch was never pushed. No forks or clones exist outside the local machine. GitHub's push-protection detection itself touched the key strings during the rejected push attempt, but GitHub's secret scanning is trusted infrastructure and the rejection is the good outcome, not an additional exposure event.
Why this entry exists. Credential hygiene failures are worth documenting, not hiding. A reviewer who reads this entry sees a developer who: made a mistake, caught it via automated tooling working as designed, rotated keys before touching git, rewrote history surgically with a backup as the safety net, verified the rewrite across six independent checks, and preserved audit-trail integrity through the SHA remap. The honest-evaluation brand extends to credential-handling incidents — the alternative of pretending this didn't happen, or silently unblocking the secret- scanning rejection to push exposed values to a public repo, would be a strictly worse outcome for both security posture and brand credibility.
Procedural lessons for DECISIONS.md going forward. Prefer
explicit commit SHAs over positional references like "this commit"
or "commit above" — positional references do not survive history
rewrites as robustly as explicit SHAs do. The "Fix 2 outcome"
entry above was identified during this incident as missing an
explicit SHA reference to the Fix 2 revert commit (post-rewrite
SHA 27c2e17); this is tracked as parallel-tracks item #7 for a
retroactive fix in the next docs pass.
Round 2 — Google API key format in a test fixture
After the round-1 rewrite was complete and the feature branch had
been pushed to origin for the first time, GitHub secret scanning
raised a second alert (alert #1, secret_type: google_api_key)
against tests/test_output_validator.py line 152 at pre-round-2
commit 8ebe3964af7d (security: fail-closed on secret extraction and env var leakage). The alert was on a test fixture inside a
@pytest.mark.parametrize list, structurally consistent with the
other fake fixtures in the same list (OpenAI sk-test123,
Anthropic sk-ant-xyz, AWS AKIAIOSFODNN7EXAMPLE). The Google
fixture, however, was 35 chars after the AIza prefix and matched
both GitHub's detection pattern and the output validator's own
detection regex exactly.
Disambiguation. Asked whether the string was a hand-typed fake or a real-leaked Google API key, the developer confirmed: (1) yes, a Google API key had been created at some point in a GCP or Google AI Studio context unrelated to this project, and (2) no, the string on line 152 was not recognizably hand-typed. Combined with the structural inconsistency against the other clearly-fake fixtures in the same parametrize list, the safe interpretation was to treat it as potentially real and rotate + rewrite rather than dismiss as false positive.
Actions, in order.
Google API key rotation. All Google API keys on the developer's GCP and Google AI Studio accounts rotated at the provider dashboards, regardless of which specific key matched line 152, because the specific match was not known with certainty. Rotation confirmed before any git operation.
Billing/activity check. Verified Google Cloud billing and API activity on every project for the window since commit
8ebe3964af7dlanded (2026-04-12 18:18). No unauthorized activity observed.Why the validator regex and GitHub's detector are identical. The output validator's regex at
agent_bench/security/output_validator.pyline 23 is\bAIza[0-9A-Za-z_\-]{35}\b— byte-for-byte identical to GitHub's secret-scanning Google API Key detection pattern. This means there is no static test fixture that satisfies the validator's test assertion (the validator must block the input) without also triggering GitHub's push protection. Any replacement with a fixture that matches the validator's regex is immediately re-flagged; any replacement with a fixture that does not match the validator's regex breaks the test assertion. The cleanest resolution is to remove the Google fixture from the static parametrize list entirely and restore Google API key format coverage via a runtime-generated fixture that constructs a 35-charAIza-prefixed string at test time and never lands as a literal in source code. Tracked as a parallel-tracks item. The output validator's regex is NOT weakened; the test loses one of seven parametrize cases but continues to verify OpenAI, Anthropic, AWS, JWT, and env-var-assignment detection.Round-2 filter-repo. Ran
git filter-repo --replace-text <file> --forcewith the pattern file containingregex:AIza[A-Za-z0-9_\-]{35}==>AIzaFIXTUREREDACTED. This replaced the Google API key format anywhere it appeared in any historical blob across the entire repository. Every commit from8ebe3964af7dforward was rewritten, which cascaded through the full post-round-1 history including all round-1-remapped SHAs and tonight's 5 commits. Total commits processed: 186. filter-repo's internal commit-map wrote 152 changed entries and 35 unchanged entries (commits before8ebe3964af7dthat never touched the pattern).Working-tree fixture removal. After the filter-repo rewrite,
tests/test_output_validator.pyline 152 read"google says AIzaFIXTUREREDACTED"(15 chars afterAIza, below the validator's 35-char regex threshold). Removed the line entirely from the parametrize list and added a block comment explaining the removal, the regex-collision reason, the parallel-tracks item to restore via runtime-generated fixture, and an explicit note that the validator's regex remains unchanged. Committed as a separate new commit on top of the rewritten history.Round-2 verification sweep. Re-ran the same six-check sweep:
git log,git rev-list --all --objects, reflog, fsck, stash, and a precise regex scan across all blobs for the\bAIza[0-9A-Za-z_\-]{35}\bpattern. Zero blobs in the post-round-2 object database contain a 35-charAIzapattern. The scrub is complete across all history.Round-2 DECISIONS.md SHA remap. The round-1 remap table above uses SHAs
213da36,125dac0,5c1f49f,4454894as the "NEW (post-rewrite)" column. These are the post-round-2 SHAs; they weree6d9675,c1d8163,740c9d5,6d177baafter round 1 and got rewritten again by round 2. To avoid a three-column mapping table showing intermediate round-1 SHAs, the table above reads as a direct pre-rewrite → current-state mapping. The round-1-only intermediate SHAs are preserved in this narrative as "round-1 SHAs" for audit completeness but are not the canonical SHAs anyone looking up a commit should use. The canonical SHAs are the post-round-2 values.Additional round-2 SHA update: parallel-tracks item #7 (Fix 2 revert commit SHA missing from the Fix 2 outcome entry) was updated from
8c836f5(post-round-1) to27c2e17(post-round-2).
Exposure scope, round 2. The branch had been pushed to origin
exactly once before round-2 was discovered (the first push at the
end of round 1, which landed commit 3167b59 at origin). The
feature branch was the only affected ref — main was not updated,
and no PR had been merged. The round-2 cleanup requires a
force-push with --force-with-lease to overwrite the pushed
round-1 history with the round-2 history. Force-push is normally a
discipline concern, but here it is safe: the branch was published
less than one hour before round-2 was discovered, no other work
was based on the pushed round-1 history, and the force-push is
scoped to this specific branch (not main or any long-lived ref).
Alert dismissal. GitHub alert #1 was dismissed as
false_positive via gh api after the force-push, with the
resolution comment noting that the pre-round-2 commit SHA the
alert referenced (8ebe3964af7d) no longer exists in the
rewritten history and the test fixture has been removed from
tests/test_output_validator.py pending a runtime-generated
replacement.
Round-2 procedural lesson. The validator-regex ↔ detector-regex
identity is a structural finding worth noting for future security
test design. Any test fixture that verifies detection of a
specific secret format will, by construction, match the format
it is testing. If the format is one GitHub (or any upstream
detector) also scans for, the fixture will trigger an alert on
every push where it is introduced. The three durable mitigations
are: (a) generate fixtures at runtime so they never land in source,
(b) use an isolated regex that is a proper subset of the production
detector's regex so fixtures fall below the detector's match
threshold, or (c) mark the file explicitly in a
.github/secret-scanning.yml allowlist. This project is adopting
option (a) as the follow-up, because it preserves the production
detector regex without weakening and keeps the test's fidelity to
the actual attack surface.
Audit-path bug — streams masked a request-crashing failure (2026-04-15)
During v1 deploy smoke testing, every non-stream POST /ask request
and every injection-blocked request on the HF Space returned
{"detail": "Internal server error"} (HTTP 500) instead of the
intended 200 / 403. Normal queries via POST /ask/stream appeared to
work correctly from the dashboard, so the bug was invisible until a
direct curl hit a non-stream endpoint. Discovered and fixed same-day.
Root cause. agent_bench/security/audit_logger.py:60 called
self.path.parent.mkdir(parents=True, exist_ok=True) the first time
a request wrote an audit record. The default path logs/audit.jsonl
resolved to /home/user/app/logs/audit.jsonl at runtime. The
Dockerfile's WORKDIR /home/user/app creates that directory as
root:root mode 0755 because Docker's WORKDIR directive does not
honor --chown. Subsequent COPY --chown=user lines only change
ownership of the copied files, not the directory itself. At runtime
under USER user (uid 1000), the process had r-x on WORKDIR — it
could read and execute, but not create new subdirectories. First
audit write → PermissionError: [Errno 13] Permission denied: 'logs'.
Why .cache/ worked and logs/ didn't. .cache/ is created at
build time by the two RUN python scripts/ingest.py ... steps
(running as root) and then explicitly chowned by
RUN chown -R user:user .cache/. logs/ was created lazily at
runtime by the audit logger, as a non-root user, in a directory owned
by root. Directory-creation permission wall.
Why the streaming endpoint masked it. /ask/stream calls
_write_audit at the end of the event generator
(routes.py:438), after all stage events and the final answer have
already been yielded over SSE. A failure there is already too late
to affect the client's view of the response — the client sees the
answer, the stream ends, and the audit entry is silently missing.
Non-stream /ask calls _write_audit synchronously before return
(line 263), so the failure propagates up through the middleware
exception handler and becomes a visible 500. Injection-blocked
requests on both endpoints also audit synchronously before returning
(lines 193, 302) and produce the same visible 500. Normal dashboard
use hits /ask/stream → symptom invisible; smoke testing
non-stream /ask → symptom obvious.
Traceback confirmed from the HF Space runtime log:
File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream
_write_audit(
File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit
audit_logger.log(record)
File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log
self.path.parent.mkdir(parents=True, exist_ok=True)
PermissionError: [Errno 13] Permission denied: 'logs'
Two-fix patch. Both applied in the same deploy cycle:
- Dockerfile (proximate). Commit
ca34ccb:RUN mkdir -p logs && chown -R user:user .cache/ logs/at build time, beforeUSER user. Ensures the runtime user owns the audit-log directory from container start. AuditLogger.log()(architectural, defense-in-depth). Commit25e0f1b: wrap the entire write body intry/except Exception, log the failure via structlog asaudit_write_failedwith the exception type and path, return normally.
Why both, not either alone.
- Only fix 1: unblocks the Space but leaves the architectural bug — an audit logger that can crash the app is misdesigned regardless of filesystem permissions. The SECURITY.md OWASP LLM10 writeup would be structurally weak.
- Only fix 2: stops the crashes, but audit writes would still
silently fail on HF Spaces because
logs/remains unwritable. The OWASP LLM10 claim would be live-untrue on the demo surface. - Both: demo is honest (audit log actually written on HF Space), architecture is sound (audit failures degrade gracefully), and the OWASP LLM10 writeup has live evidence backing both claims.
Verification on the live Space after the patch. GET /health
flipped from "degraded" to "healthy" with provider_available: true; POST /ask normal returned 200 with grounded answers citing
real sources from both corpora (was 500); POST /ask with the
ignore_previous injection pattern returned 403
"Request blocked: potential prompt injection detected" (was 500);
the runtime log shows injection_detected pattern=ignore_previous tier=heuristic followed by a 403 status and no audit_write_failed
entries, confirming the audit write succeeded end-to-end.
Methodology lesson — streaming endpoints can hide backend failures. Any request-path side effect (audit logging, metrics, cache write, persistence) that runs at the end of a streaming generator will fail silently from the client's perspective, because content is already committed to the wire before the failure fires. Testing such side effects against their failure modes through a streaming happy path is not sufficient — the happy path masks the failure. Every request-path side effect needs at least one test that verifies it fails loudly (or, where the design says graceful, degrades loudly) when its underlying dependency is broken, exercised synchronously, not via the streaming code path.
For this project, AuditLogger.log() should gain a unit test that
asserts it returns normally (not raises) when its target directory is
unwritable — i.e., a regression test for the try/except fix above.
And the dependency injection of the audit logger in the route
handlers should get an integration test that exercises both the
streaming and non-streaming paths against the same failure injection,
to assert the streaming path does not hide what the non-streaming
path surfaces. Tracked as a v1.1 test-coverage follow-up.
Why this wasn't caught locally. Existing
tests/test_audit_logger.py tests use pytest's tmp_path fixture,
which is always writable by the test process. The failure mode only
manifests when the runtime user cannot create the log directory, a
deployment-environment condition not reproducible in a unit test
against a local filesystem owned by the developer. An integration
test that runs the Docker image end-to-end and hits /ask against
the actual container filesystem would have caught it pre-deploy.
Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
build infrastructure) but is the right long-term mitigation for this
class of bug.
LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call
The continuous-score single-call judges in agent_bench/evaluation/metrics.py
(answer_faithfulness, answer_correctness, _judge_call) are deleted
and replaced by the per-dimension Judge layer at
agent_bench/evaluation/judges/. Hard cut, no deprecation cycle.
Design doc: docs/plans/2026-05-04-judge-layer-v1-design.md.
Why this is a supersession, not a refactor. The new layer differs from
the old on six axes: discrete-anchored scale (vs continuous 0–1),
reasoning-before-score JSON ordering (vs score-first), per-dimension
judges (vs combined faithfulness/correctness), full provenance per call
(judge_id + rubric_version + system_output_hash + prompt_seed; old had
none), composable variance wrappers (rubric_permute, jury — old was
single-call), and an intentional abstain-vs-raise discipline (vs silent
None from a bare except Exception).
Evidence backing the supersession claim — the calibration κ table quantifies the new layer's agreement with hand-labels across 6 ablation rows (baseline + 3 variance ablations + permute + 2-judge jury). The files defending this entry's claim, by file path:
measurements/2026-05-04-judge-calibration-labels.jsonl— 30 items × 3 dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the literature ceiling). Lands in Phase 10.results/calibration_v1_judge_baseline.json,_baseline_no_cot.json,_baseline_no_anchors.json,_baseline_no_abstain.json,_permute.json,_jury_kappa_weighted.json— per-row predictions. Land in Phase 11.docs/_generated/kappa_table.md— generated κ ablation table copy- pasted into the writeup. Lands in Phase 11.docs/judge-design.md— interpretive writeup with the closing "when NOT to use LLM-judge" position. Lands in Phase 12.
Config-knob preservation. evaluation.judge_provider is unchanged
across all 5 YAML configs; new evaluation.judge_dimensions field
defaults to the three v1 dimensions. Zero user-facing config migration.
Out of scope (v1.1+). Mistral self-hosted as the third jury member, Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2 prompt optimization, citation_faithfulness in the default judge_dimensions, AC2 sympy-derived parity tests.
Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05
The Opus stress-test pass over the 30 calibration items × 3 dimensions
disagreed with the single-rater human gold on 22 of 30 groundedness
items (8/30 agreement). Relevance and completeness agreed at 28/30 and
25/30 respectively. The groundedness disagreement is consistent in
direction — every disagreed-on item is human=1, opus=0 — and has a
single root cause.
Root cause: reference-scope drift between rubric author and labeler.
agent_bench/evaluation/rubrics/groundedness.md defines the reference
scope as the gold snippets attached to each item:
The judge sees only the gold snippets — not the retrieved chunks. A claim that happens to be true in the world but is not entailed by the snippets fails groundedness.
The single-rater notes on the disagreed-on items describe checking
against the broader documentation, not against source_snippets:
"supported by the corpus", "supported by the docs", "supported by the
provided dependency snippet". For items like k8s_006 the gold snippet
is one sentence ("A ConfigMap is an API object used to store
non-confidential data in key-value pairs"), while the agent's answer
correctly synthesizes seven or eight additional claims from the full
k8s_configmap.md. Those claims are true in the world and well-supported
by the full doc, but not entailed by the one snippet. Opus applied
the strict-snippet rubric; the human rater applied a corpus-supported
rubric.
Why this blocks make calibrate against the current gold. The κ
ablation compares Haiku and GPT-4o-mini judges against the human gold.
A judge that correctly applies the strict-snippet rubric will disagree
with miscalibrated gold; a judge that's too lenient will agree. The
ablation rewards leniency and punishes rigor — the opposite of the
intended measurement. This is the same failure mode codified earlier in
this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
compensation when the measurement is wrong, not the intended effect.
Why the rubric stays as written, not relaxed to "corpus-supported".
Strict-snippet groundedness measures RAG behavior: did the agent
synthesize from what it retrieved? Corpus-supported groundedness
measures LLM general knowledge passing through a RAG harness: did the
agent happen to be correct? The first is what this benchmark is for;
the second is what agent_bench/evaluation/metrics.py measured before
supersession. Relaxing the rubric to "corpus-supported" would silently
re-introduce the failure mode the supersession entry above just removed.
Decision — three-step correction lands before make calibrate runs:
- Rubric clarification commit on
agent_bench/evaluation/rubrics/groundedness.md. Add an explicit reference-scope line and one anchored example contrasting "supported by the snippet" vs "true in the world but not in the snippet". Audit-trail requirement: the v1.1 writeup will cite "rubric clarified between v1.0 and v1.1", and the git history needs to back that claim. - Re-label the 22 disagreed-on groundedness items in
measurements/2026-05-04-judge-calibration-labels.jsonlagainst the clarified rubric, snippet-only. Do not mechanically copy Opus's labels. The labels remain the human single-rater's; what changes is the rubric being applied. Mechanical copy would turn the κ table into "judge vs Opus", which is not what the writeup claims it measures. - Recompute
make calibrateagainst the corrected gold and emitdocs/_generated/kappa_table.mdfrom the v1.1 labels.
Evidence files for the v1.1 writeup section:
measurements/2026-05-05-judge-rubric-opus-stress.jsonl— 90 Opus labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).measurements/2026-05-04-judge-calibration-labels.jsonl— original v1.0 single-rater gold; will be diffed against v1.1 corrected gold to quantify the re-label delta.agent_bench/evaluation/rubrics/groundedness.md— pre/post diff is the rubric clarification.
Pre-labeling observations also worth recording for the writeup methodology section:
q021(fastapi · calculation) answered the CORS preflight question correctly (600 / 60 = 10 minutes) withsources: []andranked_sources: []— the agent did the arithmetic without retrieval and emitted an answer consistent with the snippet without having retrieved it. Methodologically interesting for the citation-faithfulness story (Block 2.7) if it ships: an answer can be correct without being grounded-by-citation.q025(fastapi · multi_hop) answer was truncated mid-token by the orchestrator's max_tokens limit. The labels reflect what the system produced, not a mentally-patched complete version. The completeness rubric does not currently anchor "truncated response" as a level — v1.1 rubric work should add an anchor.- Several K8s items embed external knowledge that's correct but not in
the snippet phrasing (
k8s_017mentions exit-code-0 for init-container success;k8s_009describes Roles vs ClusterRoles by their semantics). The clarified groundedness rubric should pick strict on this case (claim must be supportable by the retrieved spans, not just consistent with them) and the anchored example should show that ruling.
Methodology framing for the writeup. The Opus stress-test was added specifically to catch hand-labeled-gold fragility before the κ table is published. It caught it. The writeup's calibration section should disclose the rubric clarification, quantify the re-label delta on groundedness, and report κ against the v1.1 corrected gold — that is a more credible story than a first-try clean κ table would have been.
Outcome — 2026-05-05 calibrate run on v1.1 gold. All 6 ablation rows
ran cleanly after three coupled production-code fixes that landed on the
same branch as the rubric clarification: (1) markdown fence stripping in
agent_bench/evaluation/judges/base.py::_strip_markdown_fence because
Haiku 4.5 wraps JSON output in ```json ... ```, (2) max_tokens
512 → 1024 because v1.1 anchored examples elicit longer model reasoning,
(3) calibration runner v1.0 omitted item_id from prediction records;
fixed in v1.1 with backfill of the 6 already-written row files via
hash → item_id map (no re-spend). Probe-one-cell-before-sweep saved a
fourth $0.50 wasted run after the fence-strip change — the methodology
note in feedback_judge_probe_before_sweep.md was earned by this
session's two failed full-row attempts that paid ~$1.15 for unparseable
output before the diagnosis converged.
The κ table at docs/_generated/kappa_table.md (regenerated on
2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for
completeness — see report.py _DIM_METRIC) shows three findings
that the writeup interprets rather than reports verbatim:
v1.1 finding 1 — relevance is not "judges fail" territory. Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the 29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00. AC1 is the load-bearing statistic on relevance and groundedness; both metrics agree on completeness where the gold (23×2 / 5×1) is balanced.
v1.1 finding 2 — no_cot completeness agreement is real, not
selective abstain. AC1 = κ = 1.000 at n=24. The 2 absent cells
(q021, k8s_012) are infrastructure abstains (provider rate-limit
retry exhaustion), both gold=2, neither in baseline's disagreement
set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
(3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to
agreement when CoT is removed. The interview-relevant claim is the
opposite of the conventional CoT-helps story: CoT-before-score on
3-point completeness lets the judge over-emphasize partial coverage
and rationalize 1 when the human gold sides with the holistic
"covers the points" reading.
v1.1 finding 3 — jury_kappa_weighted underperformed baseline on
completeness, with a precise mechanism. Per-member analysis from
results/calibration_v1_judge_jury_kappa_weighted_members.jsonl:
Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%;
gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw
26.9% — systematically harsh on the 3-point scale, almost never
scoring 2. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% —
matches gpt-4o-mini alone exactly because the jury verdict reduces
to gpt-4o-mini's verdict on every disputed cell.
The mechanism is missing-weight + round-down compounding, not
weighted voting in the usual sense. scripts/run_calibration.py ::_load_weights_from_baseline is a documented v1 stub that returns
weight = 1.0 for every judge_id present in baseline. baseline.json
contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
gets 1.0 from jury.py's missing-key fallback (with a logged
jury_missing_weight_fallback_to_one warning per call). Equal
weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
of 1.5; the _discretize_mean rule is frac > 0.5 → ceil else floor,
and 0.5 > 0.5 is false, so 1.5 floors to 1. gpt-4o-mini's verdict
wins every disputed cell. The v1 design doc's risks subsection listed
"jury κ worse than the better individual judge — (a) kappa-weighting
wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
both branches simultaneously: branch (a) because the weighting is a
stub returning equal weights, and branch (b) because round-down at
exact 0.5 ties hands the verdict to the lower-scoring member.
The deeper structural point is that weighting alone cannot rescue a systematically miscalibrated member. Even held-out validation that correctly assigned gpt-4o-mini's true low weight on completeness would still let it dominate disputed ties unless its weight were driven near zero — and at that point exclusion is more honest than near-zero inclusion. The conservative-on-binary "ties to lower" rule also doesn't transfer cleanly to ordinal scales: on completeness, "conservative" means scoring toward incomplete, which is precisely the direction of gpt-4o-mini's bias.
v1.2 fix list (four items, expanding the earlier two-item list):
- Held-out jury weights. Replace the
_load_weights_from_baselinestub with a real κ-derived computation, evaluated on a held-out validation set — not the same calibration row whose κ is being measured against the gold. Closes the circular-weighting hole. - Symmetric member coverage in the weights source. Missing-member
fallback to weight = 1.0 amplifies an unweighted member rather than
suppressing it. Either every jury member must have a weight in the
source file or the run must abort. The
jury_missing_weight_ fallback_to_onewarning fired loudly on every call this run; in v1.2 it should be a hard error. - Per-dimension member exclusion when individual κ falls below a threshold. gpt-4o-mini at κ = 0.020 on completeness should not be in the completeness jury at all. Weights below a floor (suggested κ < 0.2) should be treated as exclusion, not as small-weight inclusion. Held-out validation fixes circular weighting; it does not fix systematic member bias.
- Per-dimension tie-break rule. v1's
_discretize_meanrule (ties to lower) was selected for conservative behavior on binary scales, where "conservative" means scoring 0 on uncertainty. On 3-point completeness, "conservative" means scoring toward incomplete, which interacts badly with member miscalibration. v1.2 should select the tie-break rule per-dimension based on the rubric's conservative direction, not globally.
Evidence files: docs/_generated/kappa_table.md (regenerated with
AC1 for groundedness/relevance, κ for completeness);
results/calibration_v1_judge_jury_kappa_weighted_members.jsonl
(per-member sidecar where the gpt-4o-mini completeness bias is
visible per item); results/calibration_v1_judge_baseline.json
(weights source — note the absence of any gpt-4o-mini-2024-07-18
entries, which is why the missing-weight fallback fires).
v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria
Date: 2026-05-06. Status: in-flight; this entry is the pre-experiment contract that pins down what counts as success before the re-aggregation runs, so the outcome can't be negotiated post-hoc.
Sharpened diagnostic — extraction-vs-reasoning split, not just "model is
biased". Re-reading the per-member sidecar (item-level, not aggregate)
on the gpt-4o-mini completeness disputes shows a more specific failure
mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2
/ gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's evidence_quotes
field correctly extracts the paraphrased coverage from the agent answer
— and then its reasoning field denies that those very quotes constitute
coverage. k8s_002 is the cleanest instance: the model quotes the strings
"declarative updates" and "sticky identity" into evidence, then writes
"the answer does not explicitly mention 'declarative updates' and 'sticky
identity'". The score follows the reasoning, not the evidence. The
mechanism is that the model's post-extraction reasoning step applies a
literal-string-match standard to the answer text while the rubric
requires "paraphrase allowed" — i.e., the structured-output discipline
forced an extraction step that the reasoning step then contradicted on
autopilot. This is a known failure mode in chain-of-thought judges and
shows up more in smaller models because the reasoning step has less
capacity to integrate the rubric's instruction with the literal-text
comparison the model is running by default. The artifact for the writeup
is measurements/2026-05-06-gpt4o-extraction-reasoning-split.md (three
side-by-side reasoning + evidence_quotes excerpts).
Pragmatic v1.1 weights-source decision. The v1.2 fix-list above specifies a held-out validation set for jury weights — methodologically clean but requires either splitting N=30 (loses statistical power on both halves) or labeling more items (eats interview prep time). v1.1 chooses pragmatic: weights computed from the same calibration set used for κ reporting, with the circularity flagged in the writeup. Reason: (a) the alternative is splitting N=30, (b) the per-member κ values used as weights are internally consistent, (c) v1.2 will use a held-out 20- item set. The writeup will contain a sentence acknowledging the circularity rather than hiding it.
v1.1 elevated fix-list (subset of the v1.2 list above). Items 2 (symmetric coverage / hard-error) is elevated unconditionally. Item 1 (real κ-derived weights) is elevated in pragmatic form (same set with circularity caveat). Items 3 (per-dimension exclusion) and 4 (per- dimension tie-break) remain v1.2 unless B's outcome forces them up.
Pre-committed B success criteria. Plan B is "re-aggregate the existing
164 member-rows in calibration_v1_judge_jury_kappa_weighted_members.jsonl
with corrected κ-derived weights, no new API spend." The outcome maps
deterministically to one of three predefined responses, picked before
B runs:
- Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥ 0.05 (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story: "v1's weights-source bug masked correct aggregation; once both bugs (asymmetric coverage + missing-weight fallback) are fixed, the jury improves on baseline. Per-dimension exclusion remains a v1.2 design pattern but is not needed at v1.1." This is the strong story.
- Outcome 2 — jury κ within ±0.05 of Haiku-baseline (i.e., 0.366 ≤ jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity but the jury isn't doing meaningful work on completeness — gpt-4o- mini's near-zero weight makes it effectively excluded by aggregation. This is soft exclusion via weighting; v1.2 will make exclusion explicit." Defensible but less clean.
- Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05 (i.e., jury κ < 0.366). Writeup story: "weights-source fix is necessary but not sufficient; even at near-zero weight gpt-4o-mini's verdict tips disputed (1, 2) ties due to the round-down rule. v1.1 escalates to per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.
Why the predefined-criteria framing matters. "I ran B, looked at the number, decided it was good enough" is the same data with a weaker frame than "I predefined the success criteria before running the experiment, B landed at outcome X, which mapped to predefined response Y". The latter demonstrates evaluation maturity in the writeup; the former invites post-hoc reading of the outcome.
B outcome — 2026-05-06. Plan B re-aggregated the existing 164 sidecar rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on completeness; clipped at 0 from raw κ values). Result: jury κ on completeness = 0.416, exactly matching Haiku-baseline. Δ = 0.000; maps to Outcome 2 (soft exclusion via weighting). Per the pre-committed response, v1.1 stops here and writes up; per-dimension member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.
Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1)
with corrected weights aggregates as (2 × 0.416 + 1 × 0.020) / 0.436 = 1.954. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the
correct verdict. v1's two compounding bugs (asymmetric source returning
weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt-
4o-mini) jointly forced equal weights, and equal-weights with the same
round-up rule produced (2 × 1 + 1 × 1) / 2 = 1.5, which has frac
exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every
disputed cell. The bug fixes recover the right verdict purely
mechanically; no judge model behavior changes.
The empirical reading: the weighting is not doing meaningful work — gpt-4o-mini's near-zero weight effectively excludes it on completeness, and the jury's κ matches Haiku-alone exactly because Haiku's verdict wins every disputed cell. This is "soft exclusion via weighting"; v1.2's explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the exclusion visible in the jury config rather than emergent from κ-derived weight collapse.
v1.1 code changes (this commit):
agent_bench/evaluation/variance/jury.py— silent missing-weight fallback to 1.0 → hardValueError. Two existing tests that asserted the old contract (test_kappa_weighted_reasoning_reports_applied_weights_not_dict,test_kappa_weighted_logs_warning_on_missing_weight) updated to assert the new contract.scripts/run_calibration.py::_load_weights_from_baseline→_compute_kappa_weights— replaces the v1 stub with real per-judge Cohen's κ on the dimension; hard-errors when any expected member is missing from the source. Clips κ < 0 to weight = 0 (soft exclusion).configs/calibration/rows/jury_kappa_weighted.yaml—weights_sourcere-pointed fromcalibration_v1_judge_baseline.json(Haiku-only, asymmetric coverage) tocalibration_v1_judge_jury_kappa_weighted_members.jsonl(both judges, same calibration set with documented circularity).tests/scripts/test_run_calibration_dispatch.py— two new tests cover_compute_kappa_weights: (a) computes real κ (high-agreement judge → weight=1.0, chance-agreement judge → 0); (b) hard-errors on asymmetric source coverage.results/calibration_v1_judge_jury_kappa_weighted_v1_1.json— new predictions row produced by re-aggregating the existing sidecar offline (no API spend; viascripts/_dev/reaggregate_jury_v1_1.py).docs/_generated/kappa_table.mdregenerated with this row alongside the broken v1 row, giving the writeup a clean before/after diff (completeness: 0.014 → 0.416, n=26).measurements/2026-05-06-gpt4o-extraction-reasoning-split.md— the three side-by-side reasoning + evidence_quotes excerpts (q006 / k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split diagnostic finding.
The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1 (pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.
Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria)
Date: 2026-05-06. Status: in-flight; this entry pins down the hypothesis and success criteria before the experiment runs.
Hypothesis sharpened by the 1A direction-of-bias finding. GPT-4o- mini's completeness disagreements are 17/19 gold=2/pred=1 with zero up-mistakes across 26 items spanning two corpora — direction-aware noise, not balanced random labeling. The model is consistently applying some rule stricter than the rubric requires. The hypothesis under test: that stricter rule is "literal-string match required, paraphrase doesn't count," and the bias is fixable by recency-positioning the rubric's "paraphrase allowed" instruction adjacent to the commit-to-score decision instead of leaving it 500+ tokens upstream in the rubric body.
The intervention is positional, not lexical. The current
CompletenessJudge prompt (agent_bench/evaluation/judges/completeness.py)
sends the rubric body, then the gold reference, then the system answer,
then a one-line "Score this answer..." instruction immediately followed
by the JSON schema clause. The rubric body's "paraphrase allowed" clause
appears in the introductory paragraphs, hundreds of tokens before the
score decision. The intervention adds one sentence between the system
answer and the score instruction:
"Note: a paraphrase that captures the same meaning as a gold-answer point counts as covered. Score on content equivalence, not surface form."
This is the recency-positioning hypothesis: the model loses the paraphrase conditioning across the rubric anchors and the reasoning step. Restating the instruction adjacent to the score decision tests whether the bias is positionally correctable.
Selected 5 disputed items (representative of the gold=2 / Haiku=2 /
gpt=1 pattern across both corpora): q006, q011, k8s_002, k8s_006,
k8s_018. All four are pure paraphrase-coverage cases (the system
answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored
1 with the extraction-vs-reasoning split documented in
measurements/2026-05-06-gpt4o-extraction-reasoning-split.md).
Pre-committed 3A success criteria.
- Fixed (≥3/5 shift from 1 → 2): Recency-positioning is sufficient. Re-run GPT-4o-mini on the full 26 disputed items with the corrected prompt, recompute κ, update the writeup table. Story: "rubric- engineering matters more than judge model choice for ordinal scales — recency-positioning the paraphrase instruction recovered N% of disputed items." The completeness story becomes actionable, not diagnostic-only.
- Partially fixed (1–2/5 shift): Inconclusive at N=5 (binomial-
significance line is
3+). Re-run on the full 26 disputed items ($0.20) to get a clean number; write up whatever the full-26 says. - Not fixed (0/5 shift): The instruction is being received and ignored — the model can't act on it under reasoning load. Escalate to 4A (GPT-4o full on the same 5 items) to verify the small-model- specific claim. Story: "repositioning the paraphrase instruction adjacent to the score decision did not shift any of 5 disputed items; GPT-4o handled the same prompts. The bias is small-model-specific, not prompt-fixable."
The 3/5 threshold is the binomial-significance line at this N — random shifting under the null produces 0 or 1 changes most of the time. Pre- committing avoids the "2 shifted, that's kind of a fix" negotiation.
On the 1A relevance finding — confirmed. Both judges essentially correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ degeneracy is structural under 29/30 prevalence at class-2; AC1 + raw agreement is the right reporting. No further investigation on relevance. Writeup paragraph is one short sentence: prevalence-induced degeneracy → AC1 is load-bearing.
Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1)
Date: 2026-05-06. Status: complete; the v1.1.1 prompt is now
permanent in agent_bench/evaluation/judges/completeness.py.
3A 5-item probe: 3/5 disputed items shifted 1 → 2 (q006, q011, k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre- committed threshold (≥3/5 → "fixed"), so the protocol triggered the full-26 re-run on gpt-4o-mini only (Haiku held as control to make the v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on the affected judge).
Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):
| n | raw | κ | AC1 | |
|---|---|---|---|---|
| v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 |
| v1.1.1 gpt-4o-mini | 28 | 42.9% | +0.000 | +0.232 |
| v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 |
Per-item delta (v1.1 → v1.1.1): 7 items shifted up (1 → 2 or 1 → 2), 0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2 items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016, k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1 in v1.1, now over-credited at pred=2). Net per-item correctness delta: +5 items.
Cohen's κ is misleading on this comparison. v1.1.1 raw agreement
rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from
0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly
lower than v1.1's 0.020. The mechanism is prevalence-rebalancing in
the marginals: gpt-4o-mini's pred distribution shifted from {0:2, 1:19, 2:5} (concentrated at 1) to {0:4, 1:12, 2:12} (more balanced, closer
to gold's {1:5, 2:23} over n=28). Cohen's κ = (P_o - P_e)/(1 - P_e);
when marginals become more diverse, P_e (chance agreement) rises in
lockstep with P_o (observed agreement), and κ deflates. AC1 uses
prevalence-robust chance correction (P_e = (1/(q-1)) Σ pi_k(1-pi_k))
and reads the actual signal.
This is the same trap that motivated AC1 over κ on the relevance and groundedness rows of the original κ table, surfacing here at a different distribution boundary. The κ table footer already explains why per-dimension metric selection matters; v1.1.1's outcome demonstrates the trap induced by the intervention itself.
Effect on the jury aggregate. With κ-derived weights and gpt-4o- mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury verdict on completeness is now mathematically equivalent to Haiku-alone on every item (gpt's contribution is multiplied by zero). Jury κ stays at 0.416, identical to v1.1's corrected aggregate. The intervention's per-member improvement is invisible at the jury level under this weighting scheme.
Methodological consequence — v1.2 fix-list addition. The v1.2 fix- list now expands by one item:
Prevalence-robust weights for prevalence-skewed dimensions. v1.1's
_compute_kappa_weightsuses Cohen's κ for every dimension, which has a self-defeating property on prevalence-skewed gold: improving a member can lower its weight even as it gets more accurate.Mechanism. Cohen's κ =
(P_o - P_e) / (1 - P_e), whereP_e = Σ_k P(gold=k) × P(pred=k)is the chance-agreement term computed from the marginal distributions. P_e is not invariant to the predictor's marginal distribution — when a member's predictions become more diverse (less concentrated at one class), P_e rises as the marginals approach gold's marginals. Concretely: when an intervention moves a member's pred distribution from concentrated- at-one-class toward gold's distribution, P_o and P_e rise together in lockstep. The numeratorP_o - P_estays small, and κ deflates even as raw accuracy improves. This is the same prevalence-induced degeneracy that motivated AC1 over κ on relevance/groundedness rows in the κ table — it surfaces in jury weighting at any distribution-shifting intervention's boundary.Empirically observed in v1.1.1. The recency-positioning intervention shifted gpt-4o-mini completeness pred dist from
{0:2, 1:19, 2:5}to{0:4, 1:12, 2:12}, closer to gold's{1:5, 2:23}over n=28. Per-cell raw agreement 26.9% → 42.9%. AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×). Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o ≈ 0.43. v1.1's_compute_kappa_weightsclips the new κ at zero, producing weight = 0 — and the jury aggregate loses access to a member that was empirically improved. The intervention's per- member improvement is invisible at the jury level under κ-weighting.Architectural decomposition for v1.2. The right separation:
- Per-dimension metric for κ table reporting (already in v1.1
via
agent_bench/evaluation/calibration/report.py::_DIM_METRIC). - Per-dimension weight metric for jury aggregation (new in
v1.2, reuses
_DIM_METRIC). Use κ where the gold's prevalence supports it, AC1 where κ degenerates. Same lookup, same per- dimension policy at both reporting and weighting layers. - Per-dimension membership as explicit configuration override for members that are structurally inappropriate (v1.2 fix #3, unchanged) — distinct from "low score on the chosen metric," which is handled by the weight floor.
Why this is non-obvious. A reader's first instinct is that "weight by κ" is a sensible default — κ is the standard inter- rater statistic. The self-defeating property is invisible until you observe a real intervention that shifts marginals; in static conditions (no intervention, fixed prompts), the κ-weight choice is benign. The v1.1.1 outcome is the first time the agent-bench calibration set has produced an intervention-induced marginal shift on the same gold; the failure mode wouldn't have been visible in v1.0's static calibration sweep.
- Per-dimension metric for κ table reporting (already in v1.1
via
v1.1.1 code changes (this commit):
agent_bench/evaluation/judges/completeness.py— addsPARAPHRASE_RECENCY_CLAUSEconstant, inserted between the system answer and the score instruction. Comment cites the 3A probe.tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt— extends to assert the recency clause appears AND is positioned between the answer and the score instruction (position is load- bearing, not just lexical inclusion).results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl— merged sidecar: v1.1 groundedness/relevance rows (unchanged judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku completeness rows.measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl— the 5-item probe artifact with reasoning + evidence_quotes for each.scripts/_dev/probe_3a_paraphrase_recency.py,scripts/_dev/rerun_completeness_v1_1_1.py— reproducers; not part of the production calibration runner.
No changes to the κ table. The jury aggregate κ on completeness is
unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy
described above; adding a jury_kappa_weighted_v1_1_1 row with
identical numbers would be visual noise. The v1.1.1 finding lives in
the writeup body, not the table — the per-member AC1 improvement
(0.006 → 0.232) is the headline number, surfaced as a separate
paragraph next to the κ table rather than inside it.
Total spend through Plan 3A: $0.0013 (3A probe) + $0.0075 (full-26 re-run) = $0.0088.
Plan 4A — GPT-4o (full) on the v1.1.1 residual
Date: 2026-05-06. Status: complete. Run after the writeup- framing review surfaced that v1.1.1's "fixed" verdict was overclaim- prone — 5/19 items were recovered, 14 remained unchanged and uncharacterized. 4A was originally scoped as conditional on 3A not being fixed (per the predefined sequencing rule), but became valuable as a post-3A diagnostic to characterize the residual: is it small- model-specific or rubric-under-specified?
Scope. GPT-4o (gpt-4o-2024-08-06) on 5 of the 14 v1.1.1-unchanged
items: k8s_006, k8s_018, q011, q012, k8s_001. Same v1.1.1
production prompt (paraphrase recency clause active). The first two
(k8s_006, k8s_018) are the items that didn't shift in the original 3A
5-item probe — we have gpt-4o-mini's reasoning on those items with
the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt
varying only the model. q011, q012, k8s_001 cover the broader
fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A
checks whether GPT-4o agrees with gold or with Haiku).
Result: 5/5 correct. All 5 items scored 2 by GPT-4o, matching gold exactly. Cost: $0.0011 reported (caveat: pricing config falls back to gpt-4o-mini rates for unlisted models, so actual cost is closer to $0.005–0.01 — the reported number under-reports by ~5–10×).
Sharpened mechanism — criteria-invention, not just literal-match.
The original 3-example artifact (q006, k8s_002, k8s_018) was framed
as gpt-4o-mini "applying a literal-string-match standard" while
correctly extracting paraphrased coverage into evidence_quotes. 4A's
side-by-side reasoning on k8s_018 shows a distinct second mechanism:
- gpt-4o-mini (v1.1.1, score 1): "It mentions some key points
from the reference... but does not explicitly state that the new
fields in
autoscaling/v2are preserved as annotations when usingautoscaling/v1, nor does it mention the need to useautoscaling/v2directly for memory or custom metric scaling for a Deployment or StatefulSet." - gpt-4o (4A, score 2): "The answer covers all the key points from the reference. It mentions that the current stable version is autoscaling/v2, which supports scaling on memory and custom metrics, similar to the reference. It also notes that autoscaling/v1 only supports CPU-based scaling, aligning with the reference's points."
The reference for k8s_018 specifies three points: (1) autoscaling/v2 is the current stable API, (2) it adds memory metrics support beyond v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's reasoning step invents additional criteria the reference does not require ("preserved as annotations when using autoscaling/v1," "use autoscaling/v2 directly for ... a Deployment or StatefulSet") and then deducts against them, scoring 1. GPT-4o reads the reference's three points and scores against exactly those, scoring 2.
This is a capacity finding distinct from the paraphrase-recency finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing it toward paraphrase semantics, manufactures additional gold criteria during scoring that aren't in the reference. Recency-positioning the "paraphrase allowed" clause doesn't address this — the bias isn't "missed paraphrase," it's "invented extra requirements." Two failure modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.
Implication for v1.2. With 5/5 confirmed, v1.2 fix #3 (per- dimension membership) gets clean empirical support: gpt-4o-mini is the wrong tool for 3-point completeness with paraphrase semantics, and no amount of prompt engineering on this rubric is going to bridge the capacity gap. The right v1.2 path is one of:
- Exclude gpt-4o-mini from completeness scoring (per-dim membership; jury reduces to single-judge Haiku on completeness; explicit and visible in config).
- Replace gpt-4o-mini with GPT-4o on completeness (per-dim judge selection; jury keeps two members but the second is a frontier-class model on the dimension that needs it).
Both are defensible v1.2 designs. The choice depends on cost
budget — gpt-4o is 10× the per-call cost of gpt-4o-mini. For
agent-bench's calibration set scale (30 items × per-row), even gpt-
4o is trivially cheap; for production deployment evaluating thousands
of agent outputs, the cost trade-off matters more.
4A artifact: measurements/2026-05-06-4a-gpt4o-full-probe.jsonl
(per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs
with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for
the side-by-side analysis above).
Updated honest framing for the writeup. "v1.1.1 addressed one identified failure mode (paraphrase-instruction-loss across reasoning, recovered 5/19 disputed items via positional change). 4A confirmed the residual 14 are a distinct failure mode (capacity-limited criteria invention during the reasoning step) — GPT-4o handles all 5 sampled residuals at the same v1.1.1 prompt, so the failure is small-model- specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge membership / model selection) is the right escalation; the rubric itself doesn't need changes."
Total session spend: $0.0099 reported (~$0.013–0.018 actual after gpt-4o pricing correction).