Spaces:
Running
Running
| # Design Decisions | |
| ## Why build from primitives, not LangChain? | |
| I wanted to demonstrate I understand tool dispatch, memory management, | |
| and retrieval orchestration at the implementation level. My provider | |
| abstraction is ~150 lines. When reranking or a second provider is needed, | |
| I know exactly where it plugs in β because I built every layer. | |
| ## Why one provider in V1? | |
| The interface supports multiple providers. V1 shipped OpenAI + Mock to | |
| prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming | |
| that switching providers is a one-line config change. The orchestrator | |
| and tools are completely unchanged between providers. | |
| ## Why one domain (technical docs)? | |
| Clean text produces clean evaluation. Research papers introduce PDF | |
| parsing noise (tables, figures, formulas) that degrades eval quality | |
| without adding signal. The framework handles any text corpus β the | |
| domain is a config swap. | |
| ## Why Reciprocal Rank Fusion, not score normalization? | |
| BM25 scores and cosine similarities live on different scales. | |
| Normalizing across scales is brittle: min-max normalization is | |
| sensitive to outliers, z-score requires distribution assumptions. | |
| RRF fuses by rank position `1/(k + rank)`, which is robust, | |
| parameter-light (only k=60), and well-studied. Trade-off: loses | |
| magnitude information, but for top-5 retrieval this rarely matters. | |
| ## Why ~16 curated docs, not a large corpus? | |
| Small corpus produces predictable retrieval, reproducible benchmarks, | |
| and easy debugging. Golden dataset questions map to specific source | |
| files. A reviewer can verify any result by reading the source. The | |
| framework scales to larger corpora β the choice is about evaluation | |
| quality, not capability. | |
| ## Why no reranker in V1? | |
| Feature-flagged in config (`rag.reranker.enabled: false`). V1 | |
| benchmarks without reranking establish an honest baseline. V2 adds | |
| cross-encoder reranking and shows the delta. | |
| ## Why no delete in the vector store? | |
| FAISS flat index doesn't support efficient deletion. For a small | |
| corpus that changes rarely, rebuild-on-ingest is simpler and | |
| eliminates consistency bugs. | |
| ## Why async internals, sync user behavior? | |
| FastAPI and the OpenAI SDK are async-native. Using async for I/O | |
| avoids blocking the event loop. V2 added SSE streaming (`/ask/stream`) | |
| for the final synthesis step β tool calls remain non-streamed since | |
| they complete in ~100ms. | |
| ## Why SQLite-backed conversation sessions | |
| V1 was stateless by design β no conversation_id, no cross-request | |
| memory. V2 adds optional SQLite-backed sessions: pass `session_id` | |
| on `/ask` to persist and load conversation history. When omitted, | |
| behavior is identical to V1 (stateless). See the dedicated | |
| DECISIONS.md entry under "Why SQLite for conversation persistence" | |
| for the full rationale. | |
| ## Why negative evaluation cases? | |
| A system that always answers sounds confident but may hallucinate. | |
| 5 out-of-scope questions test whether the system refuses gracefully | |
| when the corpus doesn't contain the answer. Grounded refusal | |
| requires both refusal language AND zero cited sources β an answer | |
| that says "not found" but still cites docs is not a valid refusal. | |
| ## Why deterministic eval + optional LLM judge? | |
| CI needs free, deterministic tests. Retrieval P@5, keyword hit | |
| rate, citation accuracy, and grounded refusal rate run without | |
| API keys. LLM-judged faithfulness and correctness are manual | |
| enrichment steps, run locally, with results committed to the | |
| benchmark report. | |
| ## Why structlog, not LangSmith/Langfuse? | |
| Third-party observability contradicts the "built from primitives" | |
| narrative. structlog provides JSON-structured logs, `/metrics` | |
| exposes latency + cost. If a team uses LangSmith, adding it is | |
| a one-day task. | |
| ## Why explicit citation format [source: filename.md]? | |
| The system prompt mandates inline citations so the evaluation | |
| harness can parse them with `\[source:\s*(.+?)\]` and check | |
| against the structured sources list. This catches hallucinated | |
| citations and measures citation accuracy as a metric. | |
| ## Why per-request retrieval settings via kwargs, not singleton mutation? | |
| The orchestrator is a singleton shared across concurrent requests. | |
| Storing `top_k` / `strategy` on `self` causes cross-request state | |
| bleed. Instead, these are passed as local variables through the | |
| tool execution kwargs β no shared state is mutated. | |
| ## Why a relevance threshold for grounded refusal | |
| V1 never refuses β it always retrieves tangentially related content and | |
| synthesizes an answer. This is a trust failure: users cannot distinguish | |
| "the system found relevant information" from "the system fabricated from | |
| vaguely related chunks." Grounded refusal rate was 0/5. | |
| We add a refusal gate in `SearchTool.execute()` based on the maximum RRF | |
| score across retrieved chunks. If no chunk scores above the threshold, the | |
| tool returns "No relevant documents found" β the LLM then refuses via the | |
| system prompt rather than fabricating from irrelevant content. | |
| **Gate location:** The gate fires in `SearchTool.execute()`, not the | |
| orchestrator. `SearchTool` is where retrieval scores are still available β | |
| they are dropped before results reach the orchestrator. This also keeps | |
| the orchestrator unchanged. | |
| **Threshold value:** `rag.refusal_threshold: 0.02` is a provisional default | |
| pending an empirical sweep across the evaluation set. The sweep will test | |
| values 0.01β0.03 and select the value that maximizes refusal on out-of-scope | |
| queries without degrading in-scope P@5 and R@5. The actual RRF score | |
| distribution will be documented here after tuning. | |
| **Interaction with reranking:** The refusal gate fires on RRF scores BEFORE | |
| reranking. It is a go/no-go decision, not a per-chunk filter. If the gate | |
| passes, the full candidate set proceeds to the reranker. This keeps the | |
| threshold calibration independent of whether reranking is enabled. | |
| **Default disabled:** `refusal_threshold: 0.0` preserves V1 behavior exactly. | |
| The feature is opt-in until the threshold is tuned. | |
| **Alternative considered:** LLM-based relevance judgment ("is this content | |
| relevant to the query?"). Rejected because it adds latency, cost, and a | |
| second point of failure. The score-based approach is deterministic, fast, | |
| and debuggable. | |
| ## Why cross-encoder reranking improves precision | |
| BM25 retrieves lexically similar but semantically irrelevant chunks. | |
| RRF fusion mitigates this partially, but noisy BM25 results still | |
| dilute the top-5 set. P@5 was 0.70 in V1. | |
| A cross-encoder (`ms-marco-MiniLM-L-6-v2`, ~80MB) scores each | |
| (query, chunk) pair jointly, capturing semantic relevance that | |
| bi-encoder similarity misses. The tradeoff is ~100β200ms extra latency | |
| per query β acceptable given our 4.7s baseline is dominated by LLM | |
| generation, not retrieval. | |
| The reranker is enabled by default. Setting `rag.reranker.enabled: false` | |
| restores V1 behavior exactly. `reranker.top_k` is independent of | |
| `retrieval.top_k`, so the reranker's output count can be tuned without | |
| affecting the RRF candidate pool. | |
| The retriever passes all RRF-fused candidates to the reranker rather | |
| than a computed subset. The reranker's `top_k` handles truncation. | |
| This is simpler and more robust than computing an input size from | |
| per-system candidate counts. | |
| ## Why provider retry with exponential backoff | |
| OpenAI returns 429 (rate limit) errors under load. Without retry logic, | |
| a single 429 causes a user-visible failure. We add exponential backoff: | |
| attempt after 1s, 2s, 4s. After 3 retries, raise `ProviderRateLimitError` | |
| so the middleware returns a clear 503. | |
| The retry wraps the raw `openai.RateLimitError` β it must fire BEFORE | |
| the error gets translated to `ProviderRateLimitError`, otherwise retry | |
| logic is dead code. Other errors (400, 401, timeout) fail immediately. | |
| ## Why in-memory API rate limiting | |
| A public-facing API needs abuse protection. We use a simple in-memory | |
| sliding window limiter: 10 requests/minute per IP. Sufficient for a | |
| demo deployment; a production system would use Redis. | |
| Known limitation: the per-IP dict grows without bound across distinct | |
| IPs. Acceptable for Fly.io with auto-stop (memory resets). If running | |
| continuously under bot traffic, add a periodic sweep or switch to a | |
| TTL-based structure. | |
| Design choices: | |
| - `/health` and `/metrics` exempt: monitoring should never be rate-limited. | |
| - `Retry-After` header: follows HTTP 429 spec, lets clients back off. | |
| ## Why SQLite for conversation persistence | |
| Three options considered: | |
| 1. In-memory dict: Lost on restart. | |
| 2. SQLite: Zero-dependency, file-based, survives restarts. | |
| 3. Redis/PostgreSQL: Adds infrastructure complexity. | |
| SQLite is right for this scale. `session_id` is optional β when omitted, | |
| the system behaves identically to V1 (stateless). This preserves backward | |
| compatibility and keeps benchmark evaluation deterministic. | |
| The route handler manages session state (load history, store Q+A), not | |
| the orchestrator. The orchestrator accepts an optional `history` parameter | |
| but has no knowledge of persistence. This keeps the agent loop testable | |
| without a database. | |
| Note: On HF Spaces, SQLite is ephemeral (no persistent storage on free | |
| tier). For the demo this is acceptable β sessions last until the container | |
| sleeps. Production would use a volume or managed database. | |
| ## Why a second provider (Anthropic) | |
| The provider abstraction existed since V1 but only had OpenAI + Mock. | |
| Adding Anthropic proves the abstraction works across fundamentally | |
| different APIs: | |
| - System message: `system=` parameter, not in the messages list | |
| - Tool definitions: `input_schema` instead of `parameters` | |
| - Tool results: `tool_result` content blocks in user messages | |
| - Tool calls: `tool_use` content blocks, not a separate field | |
| - Stop reason: `tool_use` vs `stop` | |
| The implementation is a config swap β `provider.default: anthropic` in | |
| YAML switches the entire system to Claude. The orchestrator, tools, | |
| evaluation harness, and serving layer are completely unchanged. | |
| Same retry/timeout handling as OpenAI. Both providers are tested with | |
| mocked HTTP responses β no API keys needed in CI. | |
| ## Why ranked_sources separate from deduplicated sources? | |
| The deduplicated `sources` list in `AgentResponse` is for the API | |
| response. The `ranked_sources` list preserves rank order with | |
| duplicates for evaluation metrics. P@5 and R@5 need the raw | |
| retrieval ranking, not the post-processed answer metadata. | |
| ## Why vLLM over TGI / llama.cpp | |
| vLLM has the widest model support, best throughput via PagedAttention, and a native | |
| OpenAI-compatible server (`/v1/chat/completions`). TGI is a valid alternative; llama.cpp | |
| targets different use cases (edge/CPU inference). This is a deliberate choice, not | |
| ignorance of alternatives. | |
| ## Why Modal for GPU inference | |
| Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs | |
| ~$0.50 per full 27-question benchmark run. The Docker Compose path (`docker-compose.vllm.yml`) | |
| is retained for users who have local GPUs or prefer persistent serving. | |
| ## Why split topology (K8s API + Modal GPU) | |
| The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from | |
| horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from | |
| serverless elasticity β Modal scales to zero when idle, scales up on demand with no node | |
| provisioning. Co-locating both in K8s would require GPU node pools with idle cost, | |
| node autoscaler latency, and NVIDIA device plugin management. This mirrors a common | |
| production pattern. | |
| ## Why Helm only, not Kustomize + Helm | |
| Showing two K8s deployment methods for the same app adds complexity without demonstrating | |
| distinct skills. Helm with `values-dev.yaml` / `values-prod.yaml` covers | |
| environment-specific configuration cleanly. | |
| ## Why CPU-based HPA, not custom metrics | |
| CPU utilization works without a Prometheus adapter or custom metrics server. A production | |
| improvement would use the Prometheus adapter to scale on p95 latency from the `/metrics` | |
| endpoint β this requires bridging the JSON metrics to Prometheus exposition format. | |
| Documented as a follow-up. | |
| ## Why env var fallback in SelfHostedProvider | |
| Follows the same pattern as OpenAIProvider reading `OPENAI_API_KEY`. The YAML config | |
| provides defaults; env vars override at runtime. No config loader changes needed. | |
| ## Why lazy tool-call detection, not metadata check | |
| Checking `/v1/models` metadata for tool-calling support is unreliable β model metadata | |
| doesn't consistently report this capability. Instead, the provider sends one tool-calling | |
| request on first `complete()` call with tools and checks if the response contains | |
| `tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures | |
| (timeout, 5xx) return `None` and retry on the next call rather than permanently | |
| downgrading to prompt-based fallback. | |
| ## Why two-tier injection detection, not three | |
| The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity β "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth. | |
| Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only β documented, not hidden. | |
| ## Why regex + optional spaCy for PII, not a cloud API | |
| Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses. | |
| spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (`try: import spacy`) degrades gracefully with a logged warning β no crash if someone sets `use_ner: true` without installing spaCy. | |
| ## Why append-only JSONL for audit, not SQLite | |
| One codepath, one format, no config branching. JSONL is append-only by nature β no schema migrations, no transactions, no connection pooling. Log rotation handles size. `jq` provides immediate queryability without building a custom API. | |
| The original design included an optional SQLite backend and a query endpoint (`GET /admin/audit`). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication β an inconsistency when `/ask` itself has no auth. | |
| JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned. | |
| ## Why HMAC-SHA256 IP hashing in audit logs | |
| HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, `AUDIT_HMAC_KEY` env var, or (with a logged warning) a random per-process fallback. | |
| ## Why three output validators, not four | |
| The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks β PII leakage, URL hallucination, blocklist β are deterministic with clear pass/fail semantics. | |
| ## Why buffer-then-validate for streaming output | |
| The `/ask/stream` endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative β streaming chunks immediately and appending a safety marker β leaks unsafe content to any client that stops reading after the `done` event. | |
| ## Why no authentication on API endpoints | |
| The HF Spaces demo is public by design β the `curl` examples in the README work without credentials, which is the point. Adding API key authentication would gate access but break the zero-friction demo experience that makes the project evaluable. | |
| The security pipeline protects *content* (injection detection, PII redaction, output validation), not *access*. This is a deliberate scope boundary: application-layer guardrails ensure the system behaves safely regardless of who calls it, rather than assuming trusted callers. Rate limiting (10 RPM per IP) provides basic abuse protection. | |
| A production deployment would add authentication (API keys or OAuth) at the infrastructure layer β reverse proxy, API gateway, or middleware. The security pipeline's `getattr(..., None)` pattern means auth can be layered on without modifying the existing security components. | |
| ## Why monitor mode for output validation, not gating? | |
| Output validation runs post-stream as a monitoring layer. The answer | |
| streams to the client, then validation runs and emits its verdict. Gating | |
| (buffer-then-validate) would add 4-5 seconds of dead air while the full | |
| answer generates β unacceptable streaming UX for a documentation Q&A bot. | |
| Trade-off: a hallucinated URL or PII fragment could reach the client | |
| before validation catches it. For this use case (FastAPI docs, no real | |
| PII in corpus), the risk is near-zero. The dashboard labels this | |
| "monitored" (not "gated") to be explicit about the posture. | |
| ## Why named residual risks and scope limits, not "fully mitigated" verdicts? | |
| The OWASP LLM Top 10 (2025) mapping could have been written as a 10-row table where LLM01 and LLM02 read as "addressed" without qualifiers β shorter and cleaner-looking. Rejected because OWASP's own 2025 text is explicit about what an input guardrail can and cannot do, and writing a verdict that contradicts the source the mapping cites would be compliance theater. | |
| LLM01 Prompt Injection β OWASP 2025 states that RAG and fine-tuning do not fully mitigate prompt injection, and that indirect injection through retrieved content remains a core risk class. "Fully mitigated" is unsupportable for any system retrieving untrusted content into an LLM context window, which is every RAG system including this one. The LLM01 verdict reads "addressed directly with named residual risk"; the residual-risk cell cites OWASP's own "do not fully mitigate" language verbatim. | |
| LLM02 Sensitive Information Disclosure β OWASP 2025's LLM02 mitigations span four concern classes: access controls, training-data handling, user-consent transparency, and proprietary-information governance. This implementation addresses a narrower output-side subset (output validation for PII leakage, secret formats, and URL hallucination) β not cleanly one of the four concern classes, but a narrower scope than any of them. The verdict reads "addressed directly for the applicable scope"; the scope-limit cell enumerates the four concern classes verbatim and names what addressing the broader concerns would require (multi-tenant or authenticated architecture). | |
| The tension the entry resolves is honesty-vs-scannability: a mapping that surfaces named residual risks and scope limits is longer and harder to skim than one with uniform "addressed" verdicts, but the scannable version over-claims relative to the cited source. Honest evaluation is the brand. Every verdict cell in SECURITY.md must survive a reviewer reading OWASP 2025 in a second tab. | |
| See [SECURITY.md Β§ LLM01 Prompt Injection](SECURITY.md#llm01-prompt-injection) and [Β§ LLM02 Sensitive Information Disclosure](SECURITY.md#llm02-sensitive-information-disclosure) for the verdict cells; this entry covers why the verdict discipline takes the form it does. The LLM01 "do not fully mitigate" phrasing and the LLM02 four-concern-class enumeration are canonical in SECURITY.md; the README tail and landing-page subtitle paraphrase but must preserve the named-residual-risk and scope-limit structure. | |
| ## Why additive SSE stage events? | |
| The enhanced `/ask/stream` adds `meta` and `stage` event types alongside | |
| the existing `sources`, `chunk`, and `done` events. Existing consumers | |
| that only handle the three legacy types are unaffected β they simply | |
| ignore events with unknown types. This avoids versioning the endpoint | |
| or breaking the non-streaming `/ask` contract. The `meta` event fires | |
| first (before any stages) so the frontend can display provider/model | |
| info immediately. | |
| ## Why vanilla JS for the frontend, not Alpine or React? | |
| The showcase dashboard has ~5 pieces of reactive state (pipeline stages, | |
| retrieval results, security badges, stats, chat messages). The SSE | |
| handler is inherently imperative: receive event, querySelector the | |
| target node, update classList and textContent. Wrapping this in a | |
| reactive framework adds a dependency, interview questions about | |
| "why is there a framework for 5 state variables", and indirection | |
| that fights the imperative SSE pattern. One `state` object + a few | |
| `render()` functions handles it in ~150 lines. | |
| ## Phase 1 SSE gate closure β two baselines on record, not one | |
| The Phase 1 acceptance gate for the SSE backend work (meta event, | |
| stage events, iteration-aware metadata threading, route-level | |
| injection/output-validation events) requires re-running | |
| `make evaluate-fast` and confirming numbers match pre-change state | |
| on the pinned `gpt-4o-mini-2024-07-18` snapshot. The re-run was | |
| honored literally rather than substituted with a git-diff | |
| argument, even though the SSE commits did not touch | |
| `scripts/evaluate.py`'s legacy code path. Two reasons: the | |
| re-commitment discipline that kept Fix 1 and Fix 2 honest applies | |
| equally here, and the legacy path and the `--corpus fastapi` path | |
| produce materially different baselines that cannot substitute for | |
| each other. | |
| **Two distinct baselines now exist at the pinned snapshot, and | |
| both are on record** β one per prompt path: | |
| | Baseline file | Invocation | Prompt source | In-scope P@5 | In-scope R@5 | Citation | Mean calls | | |
| |---|---|---|---|---|---|---| | |
| | `results/fastapi_preedit.json` @ `213da36` | `--corpus fastapi` | `format_system_prompt("FastAPI")` | 0.718 | 0.833 | 1.000 | 1.14 | | |
| | `results/fastapi_legacy_baseline_pinned.json` @ this commit | `make evaluate-fast` (no `--corpus`) | `tech_docs.yaml` `task.system_prompt` | 0.655 | 0.849 | 1.000 | 1.45 | | |
| Citation accuracy holds at 1.000 on both paths, both in-scope and | |
| out-of-scope. The retrieval metric deltas (P@5 β0.063, R@5 +0.016, | |
| KHR +0.045) and behavioral delta (mean tool calls +0.318 in-scope, | |
| +1.00 out-of-scope) trace to the prompt-path divergence | |
| (`scripts/evaluate.py:67` reads `task.system_prompt` in the legacy | |
| branch vs. `format_system_prompt(label)` in the `--corpus` branch), | |
| not to any change in retrieval, reranking, or refusal-gate code. | |
| This divergence is the same one the "evaluation-layer multi-corpus | |
| support lagged the serving-layer refactor" entry documents; the | |
| narrowed serving-migration deferral tracks its eventual migration. | |
| **Why both baselines are retained.** When the serving-migration | |
| deferral lands and `scripts/evaluate.py`'s legacy branch is removed | |
| (everything routes through `--corpus fastapi`), the regression gate | |
| is "post-migration `make evaluate-fast` output matches pre-migration | |
| `--corpus fastapi` output within pre-committed tolerances." That | |
| gate requires the `--corpus fastapi` baseline as the comparison | |
| reference AND the legacy baseline as evidence of the pre-migration | |
| state that is being retired. Retaining both makes the migration | |
| auditable and bounds its regression budget; retaining only one | |
| would force the post-migration run to compare against a baseline | |
| from a different prompt path, guaranteeing the gate fires on | |
| prompt divergence rather than on any actual regression. | |
| **Gate verdict: passed.** No regression vs pre-SSE legacy path | |
| expectations (citation 1.000 holds, refusal gate fires on the same | |
| 5 out-of-scope questions, retrieval numbers in sane in-scope | |
| ranges). Phase 1 SSE backend work is closed from the backend side; | |
| the frontend's consumption of iteration-aware stage events is | |
| orthogonal and owned by Week 1 step 7 (showcase UI). | |
| ## Why per-corpus refusal thresholds? | |
| FastAPI and Kubernetes have different corpus characteristics. FastAPI | |
| has 16 short, well-structured docs with sparse cross-references β | |
| relevance tends to concentrate in 1-2 chunks per query. Kubernetes | |
| has 30-40 docs with heavy cross-referencing between concepts (Pod β | |
| Deployment β Service β Ingress), which spreads relevance across more | |
| chunks. A single global refusal threshold would either refuse too | |
| aggressively on K8s (no single chunk dominates, so the top score | |
| looks "low") or not aggressively enough on FastAPI (where a | |
| moderate-scoring chunk might be the only hit and should still refuse). | |
| `CorpusConfig` carries `refusal_threshold` as a per-corpus field. | |
| Each threshold gets tuned against its own golden dataset β there | |
| is no "fair" shared threshold because BEIR showed these are not | |
| comparable across corpora. Placeholder values ship in default.yaml | |
| and are replaced by tuned values during the per-corpus evaluation | |
| sweep. | |
| ## Why corpus and provider toggles compose β corpus_map[corpus][provider] | |
| The simpler design would have been `corpus_map[corpus]` returning a | |
| single orchestrator. It ships in 10 fewer lines. It also silently | |
| breaks the provider toggle in multi-corpus mode: the orchestrator | |
| inside each corpus cell holds one fixed provider, and clicking | |
| "Anthropic" in the dashboard keeps running on OpenAI. | |
| This project's hero-tile metric is the provider comparison (`1.00 API / | |
| 0.14 7B self-hosted`). Breaking the mechanism that demonstrates that | |
| metric β on a portfolio demo where a reviewer will open DevTools and | |
| notice β would erode the honest-evaluation brand the whole repo is | |
| built around. The nested `corpus_map[corpus][provider]` structure | |
| keeps both toggles functional. Store, retriever, and search tool are | |
| shared across providers within a corpus (the expensive objects are | |
| held once per corpus); only the orchestrator varies per provider | |
| since it holds the LLM client. Per-corpus Γ per-provider memory | |
| overhead is an orchestrator struct, not a FAISS index. | |
| RSS is logged per corpus, not per corpus Γ provider, because the | |
| store is what drives memory. The provider multiplier is negligible | |
| compared to a hybrid index + embedder. | |
| ## Why one parameterized system prompt, not per-corpus templates | |
| The template is `"You are a technical documentation assistant for | |
| {corpus_label}..."`. The only corpus-specific element is the label; | |
| prompt content is identical across corpora: same citation format, | |
| same refusal language, same grounding instructions. Having two | |
| separate prompt files would invite drift β someone tweaks the FastAPI | |
| prompt for a specific failure mode and forgets to update the K8s | |
| version, and the demo silently answers differently on the two toggles. | |
| The parameterization is enforced by two tests: (a) | |
| `format_system_prompt("")` raises `ValueError` so an unresolved | |
| `{corpus_label}` can never reach the LLM, and (b) a spy on | |
| `orchestrator.run_stream` asserts FastAPI and K8s requests receive | |
| different prompts with the correct label substituted. | |
| The wording deliberately differs from the typical "don't hallucinate" | |
| RAG template: | |
| - **"refuse the question explicitly"** matches our refusal-gate | |
| mechanism. "Say so politely" is soft language that models interpret | |
| as "hedge and answer anyway". | |
| - **"do not infer, do not extrapolate, do not draw on general | |
| knowledge"** is the three-verb prohibition. "Do not fabricate" is | |
| empirically easier to slip past because models distinguish | |
| fabrication (making things up) from extrapolation (drawing | |
| conclusions from adjacent but non-authoritative context). | |
| ## Why Kubernetes curation targets recruiter-likely questions, not coverage | |
| The K8s corpus targets ~30-40 pages curated around concepts a | |
| technical reviewer would naturally type (Pod, Deployment, Service, | |
| Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that | |
| stress the reranker. Cluster administration deep-dives, tutorials, | |
| and kubectl reference are explicitly excluded β they add noise without | |
| adding reviewer value and hurt retrieval precision when adjacent | |
| content is thin on concept definitions. | |
| `data/k8s_docs/SOURCES.md` is a version-controlled curation artifact. | |
| Each ingested URL has a one-line rationale, a date pulled, and a | |
| license note. This makes the corpus reproducible and documents the | |
| curation reasoning for any reviewer who looks closely. | |
| Trade-off: the corpus is not comprehensive K8s knowledge. A question | |
| about etcd raft internals will be correctly refused. This is not a | |
| bug β the refusal is part of the demo story, and "the system knows | |
| what it doesn't know" is a feature of the grounded-refusal mechanism. | |
| ## Why no cross-corpus score comparison (inspired by BEIR) | |
| Inspired by BEIR's heterogeneous-benchmark framing (Thakur et al., | |
| NeurIPS 2021), which spans 18 datasets across 9 task types, absolute | |
| retrieval scores are not treated as comparable across FastAPI and | |
| K8s corpora β score distributions depend on chunk length, vocabulary | |
| overlap, and corpus density, none of which are held constant across | |
| domains. Only rank-ordering of system configurations within a single | |
| corpus is meaningful. Concrete consequences for this repo: | |
| - Per-corpus evaluation results are reported separately, never | |
| aggregated into a single "combined" number. | |
| - The hero-tile citation accuracy (`1.00 API / 0.14 7B self-hosted`) | |
| stays FastAPI-specific. It is not restated as a cross-corpus average. | |
| - `make evaluate-fast` accepts a `--corpus` flag but has no "combined" | |
| mode. Anyone who wants a cross-corpus number has to run twice and | |
| acknowledge the incomparability in prose. | |
| - The landing page "Key Findings" cards avoid sentences that compare | |
| FastAPI and K8s numbers directly. | |
| The multi-corpus demo is a **surface feature for interactive | |
| exploration**, not a rebenchmark. The benchmark section of the README | |
| remains FastAPI-only and cites 27 questions on 16 docs with specific | |
| chunker settings. | |
| ## K8s golden dataset uses CRAG's 8-type taxonomy as the schema | |
| The K8s golden dataset uses CRAG's 8-type taxonomy (Yang et al., | |
| NeurIPS 2024) **as the schema** for `question_type`, not as a | |
| requirement to cover all 8 types. CRAG's taxonomy: `simple`, | |
| `simple_w_condition`, `set`, `comparison`, `aggregation`, | |
| `multi_hop`, `post_processing_heavy`, `false_premise`. Temporal | |
| dynamism is a separate orthogonal property captured as | |
| `time_sensitive: bool` on the question schema β it is not a CRAG | |
| category. | |
| Target distribution across the 25-question K8s golden set: | |
| - `simple` (5β6): baseline retrieval | |
| - `simple_w_condition` (3β4): nuanced understanding under conditions | |
| - `comparison` (3β4): retrieval across concept pages, reranker stress | |
| - `multi_hop` (5β6): synthesis across 2β4 docs, reranker stress | |
| - `false_premise` (3β4): grounded refusal mechanism | |
| - `set` / `aggregation` / `post_processing_heavy` (0β3): included | |
| only where corpus content naturally supports | |
| `time_sensitive: bool` flags 2β3 questions targeting version-bounded | |
| content (feature state, deprecations, API version migration). | |
| `false_premise` questions come in two flavors (see separate | |
| "False-premise questions come in two flavors" entry): pure refusal | |
| (flavor A) and documented negative (flavor B). The K8s set includes | |
| at least one of each. Flavor A tests the path where retrieval | |
| correctly returns nothing useful; flavor B tests the path where the | |
| corpus contains an explicit negative answer and the agent must | |
| surface it with citation rather than confabulating a positive. | |
| Rationale for using CRAG as schema (not coverage requirement): | |
| `false_premise` and `time_sensitive` stress grounded refusal and | |
| reduce test-set contamination risk; `multi_hop` and `comparison` | |
| stress the reranker because relevance spreads across multiple | |
| chunks. The distribution was chosen to exercise the parts of the | |
| pipeline the benchmark story claims β not to mimic a general-purpose | |
| QA benchmark. | |
| The golden dataset JSON schema (v2, backward-compatible with the | |
| FastAPI flat list) includes: | |
| - `source_chunk_ids: list[str]` for multi-hop partial credit | |
| (answer must cite at least one of the expected chunks) | |
| - `source_snippets: list[str]` for human-readable context during | |
| review | |
| - `question_type: str` (CRAG taxonomy value) | |
| - `is_multi_hop: bool` for filtered reporting | |
| - Dataset-level header with `corpus`, `version`, `snapshot_date`, | |
| and pinned `chunker` parameters so the dataset is reproducible | |
| against a specific K8s docs snapshot | |
| See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` for the | |
| full schema and rationale. | |
| ## EU AI Act corpus deferred to v1.2 | |
| EU AI Act compliance mapping is deferred to v1.2. Rationale: v1 | |
| ships two corpora (FastAPI, K8s) to demonstrate the multi-corpus | |
| architecture; EU AI Act as a third corpus would add ingestion and | |
| golden-set work without exercising architecturally new surface. | |
| Scoped as the first v1.2 addition after v1 launch. | |
| ## Cold-start contingency: measure first, lazy-load if needed | |
| Loading two corpora at startup costs memory and cold-start time. On | |
| HF Spaces (target deployment), the realistic ceiling is 8-10 GB | |
| resident RAM and ~60 seconds cold-start before the demo feels broken. | |
| **Policy:** | |
| 1. Measure HF Spaces cold-start on Day 1 of deployment. | |
| 2. If cold-start < 60 s: plan validated, no changes. | |
| 3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager, | |
| K8s lazy on first K8s request). Scoped ~2 hours implementation. | |
| This contingency is **not** pre-built. Pre-building a lazy-load path | |
| that may never ship creates dead code that rots, and the test surface | |
| for "lazy loading plus corpus routing plus provider switching" is | |
| non-trivial. The RSS logging in `app.py` (Task 2) emits the exact | |
| numbers needed to make the decision; the decision is documented here | |
| so future-me remembers the threshold and doesn't optimize prematurely | |
| on a hunch. | |
| ## Cold-start gate fired β assumption falsified, fix deferred to v1.1 at the right cause | |
| The preceding "Cold-start contingency" entry pre-committed a lazy-load | |
| fix (FastAPI eager, K8s lazy on first request) if the measured cold | |
| start exceeded 60 seconds. Measurement falsified the entry's core | |
| assumption: **corpus loading is not the dominant cold-start cost**. | |
| The committed fix addresses ~1 % of the observed overshoot. Executing | |
| it verbatim would honor the gate's letter but not its intent β theater | |
| dressed as discipline. This entry documents the measurement, the | |
| falsified assumption, and the new contingency pre-committed at the | |
| actual cause. | |
| **Measurement (N=3, 2026-04-15, HF Spaces target deployment):** | |
| | Sample | Cold start | Silent Python init | Visible phase | | |
| |---|---|---|---| | |
| | N=1 | 113 s | ~101 s | ~12 s | | |
| | N=2 | 89 s | ~70 s | ~19 s | | |
| | N=3 | 129 s | ~115 s | ~14 s | | |
| - Median 113 s, mean ~110 s, range 89β129 s (spread ~40 s) | |
| - **Gate fire is unambiguous at both tails.** Even the fastest sample | |
| (89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is | |
| ~115 % over. No boundary ambiguity. | |
| - **Sample-size justification.** N=3 is acknowledged as a small sample. | |
| It is adequate here because (a) the gate-fire conclusion is stable | |
| across both tails, (b) the "silent Python init dominates variance" | |
| finding is stable across all three samples (silent phase varies | |
| 70 β 115 s across runs; visible phase varies only 12 β 19 s), and | |
| (c) the cost of additional samples (manual HF Space restart + ~2 min | |
| wait + log extraction per sample) exceeds the marginal information | |
| gain once both tails fire the gate and the variance pattern is stable. | |
| N=4 would tighten the confidence interval on the median but does not | |
| change either the gate-fire conclusion or the falsified-assumption | |
| finding. | |
| - **Variance source named.** HF Spaces shared-infrastructure CPU / IO | |
| contention during Python module imports. The silent-init phase | |
| varies 45 s across samples (70 β 115 s); the visible phase is stable | |
| (12β19 s). That is the signature of host-level contention on a | |
| shared physical node, not code-level variability. An | |
| exclusively-owned container would plausibly show a tighter bound. | |
| - **Raw log captures** (preserved so this entry can be cross-checked | |
| against the underlying evidence without re-running the measurement): | |
| `measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`. | |
| **Where the cost lives.** At the median (113 s): | |
| - **Silent Python init phase β ~90 s (β 80 % of total):** interpreter | |
| start, module imports (`torch`, `transformers`, `langchain`, `faiss`, | |
| `fastapi`, `httpx`, the full dependency closure), and initial model | |
| weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder | |
| reranker). Not logged β no observability inside the import chain. | |
| - **Visible startup phase β ~15 s (β 15 % of total):** injection | |
| classifier init (~10 s, includes the "classifier skipped" warning), | |
| FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s, | |
| +25.8 MB RSS), reranker warmup (~2 s). | |
| **The K8s corpus load β which the pre-committed fix was designed to | |
| defer β contributes under 1 second of the 113-second median.** | |
| Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load | |
| is the same order of magnitude. Corpus loading is simply not where the | |
| cost lives on this deployment. | |
| **Why we are not executing the pre-committed fix.** The preceding | |
| contingency was written under an empirical assumption about cost | |
| attribution (corpus loading is the dominant cost). Measurement | |
| falsified the assumption. Implementing the fix anyway would be a | |
| mechanical execution of a recipe whose premise has been disproven β | |
| it checks the gate-honoring box while failing to address the cause. | |
| That is structurally identical to relaxing-by-redefinition ("60 s was | |
| too tight"), just in the opposite direction: **relaxing by execution**. | |
| The pre-commitment rule's purpose is to prevent motivated reasoning | |
| about the gate, not to mandate mechanical compliance with a recipe | |
| whose empirical foundation has collapsed. | |
| The honest action is (1) accept the measurement as the v1 baseline, | |
| (2) document the falsified assumption explicitly (this entry), | |
| (3) re-pre-commit a new contingency at the actual dominant cost with | |
| an explicit trigger condition so the decision is not relitigated at | |
| review time, and (4) update the user-facing README surface to reflect | |
| the measured cold-wake number rather than the optimistic pre-deploy | |
| estimate. | |
| **v1.1 contingency β pre-committed:** | |
| > **If HF Spaces traffic produces more than N cold wakes per day** | |
| > (N to be determined from observed usage patterns after launch, **not | |
| > estimated in advance**), defer eager loading of (a) the cross-encoder | |
| > reranker, (b) the sentence-transformers embedder, and (c) the | |
| > injection classifier tier to first-relevant-request. | |
| > | |
| > **Estimated work:** 4β6 hours (lazy-init wrappers + first-request | |
| > caching + integration tests for the warm/cold transition). | |
| > | |
| > **Expected tradeoff:** cold wake ~113 s β ~50β60 s (approaches the | |
| > original 60 s target); **first request after any cold wake incurs | |
| > +8β15 s** additional latency (model weights load synchronously in | |
| > the request path), after which subsequent warm requests return to | |
| > normal ~5 s latency. | |
| > | |
| > **Trigger is usage-justified, not estimate-justified.** Until real | |
| > traffic data justifies the work, there is nothing to optimize β a | |
| > recruiter demo that gets one cold wake per day does not pay for | |
| > 4β6 hours of engineering plus the new first-request-latency failure | |
| > mode. The trigger threshold N is left unnamed deliberately: naming a | |
| > number in advance would invite the same falsification pattern this | |
| > entry is documenting. | |
| **Methodology lesson.** When a pre-committed contingency is written | |
| under an empirical assumption, the contingency only holds if the | |
| assumption survives measurement. If measurement falsifies the | |
| assumption, the correct action is to document the falsification, | |
| accept the observed baseline, and re-pre-commit at the actual cause. | |
| The wrong action is to execute the original recipe anyway, which | |
| trades one form of motivated reasoning (threshold relaxation) for | |
| another (recipe compliance). The underlying discipline β "pre-commit | |
| your gates and honor them" β does not mean "mechanically run the | |
| pre-committed fix regardless of what it addresses." It means "honor | |
| the gate's *intent*, which is to prevent motivated reasoning about | |
| pass/fail." | |
| **Post-hoc refinement (2026-04-22) β three latency regimes observed.** | |
| Follow-up warm-latency measurements after the initial cold-start | |
| characterization distinguish a middle regime that the README's | |
| "~2 min cold / ~5 s warm" phrasing collapses: | |
| | Regime | Latency | Cause | | |
| |---|---|---| | |
| | Cold-start (container spin-up) | 89β129 s | Python imports, model load (per the table above) | | |
| | Wake-from-idle (first `/ask` after container is up) | ~6.7 s (β +2 s over warm) | Residual lazy init on first request | | |
| | Steady-state warm | ~5 s (mean 4.74 s, n=5) | LLM API round-trip + retrieval | | |
| Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state | |
| warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim | |
| captures regimes 1 and 3; the ~2 s first-hit penalty between them is | |
| a refinement of the warm-cold boundary, not a contradiction. The | |
| v1.1 projection of "+8β15 s first request after any cold wake" | |
| measures the delta from this ~2 s current baseline, not from a | |
| zero-penalty starting point β the true v1.1 UX cost over current | |
| behavior is closer to +6β13 s. | |
| ## False-premise questions come in two flavors | |
| When authoring golden-dataset questions whose premise is wrong, the | |
| question can point at one of two genuinely different failure modes. | |
| Both are valid; they test different pipeline paths and should be | |
| labeled distinctly so the evaluator routes correctly. | |
| **Flavor A β pure refusal.** The premise is not addressed anywhere in | |
| the corpus. Example: "How do I configure Claude API rate limits in | |
| Kubernetes?" K8s has no such concept. Schema: `category: "out_of_scope"`, | |
| `expected_sources: []`, `source_snippets: []`. The evaluator's | |
| `grounded_refusal` metric expects the answer to contain a refusal | |
| phrase ("does not contain", "no information") AND cite zero sources. | |
| Tests the pipeline path where retrieval correctly returns nothing | |
| useful and the agent correctly declines. | |
| **Flavor B β documented negative.** The corpus contains an explicit | |
| negative answer. Example: "How do I configure NetworkPolicy to enforce | |
| mTLS?" The K8s NetworkPolicy docs have a "What you can't do with | |
| network policies" section that explicitly says "Anything TLS related | |
| (use a service mesh or ingress controller for this)". Schema: | |
| `category: "retrieval"`, `question_type: "false_premise"`, | |
| `expected_sources: [<the negative-answer page>]`, `source_snippets: | |
| [<the verbatim negative statement>]`. The evaluator expects the agent | |
| to retrieve the page, find the negative statement, and answer | |
| negatively with a citation. Tests the stricter path where the corpus | |
| genuinely contains the answer and the agent must not hallucinate a | |
| contradictory capability. | |
| **Why both matter for the honest-evaluation brand.** Grounded refusal | |
| is not "refuse when retrieval is weak." It is "answer exactly what the | |
| source says, including when the source says no." Flavor A tests the | |
| first half (refuse when there is nothing to ground on); flavor B tests | |
| the second half (report the documented negative instead of | |
| confabulating a positive). The K8s golden dataset includes at least | |
| one of each. The first K8s pilot (`k8s_pilot_005`, NetworkPolicy | |
| mTLS) is flavor B. Flavor A is reserved for questions targeting | |
| features that genuinely do not exist in the K8s corpus; at least one | |
| such question is required in the full 25-question set. | |
| ## Pilot_005 refusal-gate + agent-behavior measurement | |
| The first K8s pilot run surfaced two distinct flavor-B failure modes | |
| on `k8s_pilot_005` (NetworkPolicy mTLS). Both are empirical, both | |
| have specific numbers, and both are logged in | |
| `results/k8s_pilot_threshold_0.02.json` and | |
| `results/k8s_pilot_threshold_0.015.json`. | |
| **Failure mode 1 β threshold calibration (at 0.02).** The | |
| `SearchTool.execute()` refusal gate fired with `max_score=0.01639` β | |
| exactly `1/(60+1)`, the rank-1 RRF score from a single fusion system. | |
| BM25 hit "NetworkPolicy" at rank 1; the dense encoder contributed | |
| nothing, because "Anything TLS related (use a service mesh or ingress | |
| controller for this)" is a single negative sentence, not a conceptual | |
| topic the page is semantically "about." Hybrid fusion inherited only | |
| the BM25 rank-1 score. At threshold 0.02 (the FastAPI working value), | |
| the gate refused before the agent saw any chunks. Retrieval P@5 and | |
| R@5 both 0.00; answer is a generic refusal. | |
| **Failure mode 2 β agent behavior on documented negative (at 0.015).** | |
| With the threshold dropped just below the measured max score | |
| (`0.015 < 0.01639`), retrieval is perfect: P@5 1.00, R@5 1.00, all | |
| five top chunks from `k8s_network_policies.md`. But the agent still | |
| produces a flavor-A-style refusal: *"The Kubernetes documentation | |
| does not provide specific instructions on configuring a NetworkPolicy | |
| to enforce mutual TLS..."* The "Anything TLS related" sentence is in | |
| the retrieved chunks β the agent simply treats the absence of | |
| positive instructions as grounds for refusal, rather than reading the | |
| explicit negative sentence and citing it as the answer. KHR 0.67: the | |
| `service mesh` and `ingress controller` keywords (the documented | |
| alternatives the page points to) are missing from the answer. | |
| **Implication.** The flavor-B mechanism requires more than threshold | |
| tuning. Fixing the gate is necessary but not sufficient. The system | |
| prompt needs a flavor-B clause (e.g., *"if the documentation | |
| explicitly says a feature does not exist or is not supported, report | |
| that with citation β do not treat it as unanswerable"*), **or** the | |
| K8s golden dataset's flavor-B questions must use phrasing the | |
| current prompt can route correctly. The 0.30 placeholder value from | |
| the design doc was based on "prefer conservative" intuition without | |
| empirical grounding β the measured working range for K8s pilot | |
| retrieval is lower by more than an order of magnitude than that | |
| intuition, and even at the working threshold the prompt layer is the | |
| blocker. | |
| **What this measurement is.** A pilot smoke-test result, not a | |
| benchmark claim. Aggregates at 0.02: P@5 0.63, R@5 0.83, KHR 0.69. | |
| Aggregates at 0.015: P@5 0.80, R@5 1.00, KHR 0.75. Five of six pilots | |
| produce substantively correct answers on K8s content under the | |
| working threshold β evidence the retrieval stack generalizes to K8s. | |
| The pilot's job was schema validation + calibration evidence, not | |
| launch metrics. Launch metrics come from the 25-question K8s golden | |
| set with tuned threshold and (likely) a revised system prompt, | |
| sequenced after this pilot. | |
| ## Evaluation-layer multi-corpus support lagged the serving-layer refactor | |
| The Tasks 1β8 multi-corpus refactor wired corpora through | |
| `app.state.corpus_map` and the `/ask` serving route. `scripts/evaluate.py` | |
| was not touched and remained single-corpus β it read | |
| `config.rag.store_path` and `config.evaluation.golden_dataset` | |
| directly, with no awareness of the `corpora` dict. This was an | |
| accurate scoping of the refactor (serving-layer, not eval-layer) but | |
| the gap was not surfaced in the original task list. | |
| The K8s pilot commit adds `--corpus <name>` to `scripts/evaluate.py`, | |
| routing through `config.corpora[name]` for `store_path`, | |
| `refusal_threshold`, and a new optional `golden_dataset` field on | |
| `CorpusConfig`. Without `--corpus`, the legacy single-store path is | |
| preserved for backward compatibility with `make evaluate-fast` and | |
| any existing invocations. | |
| `CorpusConfig.golden_dataset` is `str | None = None` β optional | |
| rather than required β because two legitimate states exist: corpus | |
| has a golden dataset (FastAPI, K8s post-authoring), and corpus has no | |
| golden dataset yet (any corpus during bring-up). The CLI errors | |
| cleanly with *"corpus '<name>' has no golden_dataset configured"* | |
| when the field is None, rather than requiring all corpora to ship | |
| with datasets. | |
| ## Deferred: path-preserving ingestion | |
| `scripts/ingest.py` uses `doc_path.glob("*.md")` (non-recursive) and | |
| stores the bare filename as the chunk's `source` field. This forces | |
| a flat-namespace convention: FastAPI ships as `fastapi_*.md`, K8s | |
| ships as `k8s_*.md`, and golden dataset `expected_sources` are | |
| filename stems. The path-preserving alternative (recursive `rglob` | |
| plus relative-path source IDs, e.g., `concepts/workloads/pods`) was | |
| evaluated during the K8s pilot planning and explicitly deferred. The | |
| root-cause refactor would have required FastAPI re-ingestion and a | |
| rewrite of the FastAPI golden dataset's `expected_sources` β trading | |
| certain regression risk on a green baseline (288 tests, citation | |
| accuracy 1.00 on API providers) for speculative legibility benefit | |
| on K8s authoring. | |
| The `source_pages` field on `GoldenQuestion` preserves the | |
| human-readable path anchor separately from the machine identifier, | |
| so the deferral does not lose information. Authors see both | |
| `expected_sources: ["k8s_pods.md"]` (what the evaluator matches on) | |
| and `source_pages: ["concepts/workloads/pods"]` (where the content | |
| came from on kubernetes.io) in the same question record. | |
| **Pattern marker, not a promise.** This is the second visa-timeline | |
| deferral of a root-cause refactor in favor of a minimal-blast-radius | |
| fix; the first was the Mar 25 β Apr 12 P@5 slide bisection. Both | |
| deferrals were deliberate, not forgetting. Not scheduled until | |
| post-launch; marker only. Post-launch scope: modify `ingest.py` to | |
| `rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both | |
| golden datasets' `expected_sources` to path-style. Estimated 3h. | |
| ## K8s refusal_threshold empirical calibration β 0.02 β 0.015 | |
| **Change.** `configs/default.yaml`, `corpora.k8s.refusal_threshold`: | |
| `0.02` β `0.015`. Single-line config change, pilot-corpus only. | |
| FastAPI threshold unchanged. | |
| **Empirical evidence.** Diagnostic instrumentation of `k8s_pilot_005` | |
| (*"How do I configure a Kubernetes NetworkPolicy to enforce mutual | |
| TLS (mTLS) between Pods in the same namespace?"*) captured the | |
| retrieval gate firing at `max_score = 0.01639344262295082` β exactly | |
| `1 / (60 + 1)`, the algebraic floor for a single rank-1 BM25 hit | |
| under RRF with `rrf_k = 60`, dense contribution zero. At | |
| `refusal_threshold = 0.02`, pilot_005 tripped the gate and short- | |
| circuited before retrieval chunks reached the agent. At | |
| `refusal_threshold = 0.015` (one tick below the measured floor), the | |
| gate releases and retrieval proceeds. The 0.015 value is not a | |
| tuning guess β it is the nearest round-number floor below the | |
| observed gate-fire value for the single worst pilot in the set. | |
| **Validation.** `results/k8s_preedit.json` captures the full 6-pilot | |
| run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean | |
| `tool_calls_made` 1.167. All six questions receive retrieval; no | |
| gate-fire short-circuits. pilot_005 still refuses as a separate | |
| downstream issue (see next entry when the counterfactual-query fix | |
| lands); that is not a threshold problem. | |
| **Scope of this commit.** K8s only. FastAPI `refusal_threshold` | |
| (0.02) is not affected and FastAPI baseline is not re-measured. | |
| Launch-intent `0.30` placeholder for K8s remains as a comment | |
| marker; the full threshold sweep against the 25-question golden set | |
| replaces 0.015 with a properly-tuned value in a later commit. 0.015 | |
| is the pilot-floor safety value, not the production-target value. | |
| **Why this is a separate commit from the prompt revision.** The | |
| threshold calibration is empirically grounded on its own β it | |
| removes the 0.01639 gate-fire blocker, which is the precondition for | |
| any downstream evaluation of pilot_005's actual agent behavior. The | |
| prompt revision addresses a *different* failure mode surfaced once | |
| the gate releases (agent search strategy is monotone positive- | |
| framing). Two independent changes must not entangle in one commit; | |
| if the prompt revision fails its regression gate and is reverted, | |
| the threshold calibration should stand on its own empirical merit. | |
| Feedback memory `feedback_fix_before_sweep.md` applies recursively: | |
| fix measurement-affecting bugs at every layer before combining | |
| fixes into single experiments. | |
| ## Prep for counterfactual-query prompt regression β pin, wire, tolerances | |
| **Three sub-changes bundled as one prep commit, each small and in | |
| service of making the downstream regression measurement valid.** | |
| **1. OpenAI model pin.** `agent_bench/core/provider.py:208` changes | |
| `self.model = "gpt-4o-mini"` β `self.model = "gpt-4o-mini-2024-07-18"`. | |
| The unpinned alias is a known drift vector β the Mar 25 β Apr 12 P@5 | |
| slide bisection is an already-open parallel track item traceable to | |
| silent alias migration. A regression run that uses the alias across | |
| pre-edit and post-edit phases conflates prompt-clause effect with | |
| model drift, even within a single session if the alias happens to | |
| roll between runs. Pinning the dated snapshot removes the variable. | |
| Pricing dict in `configs/default.yaml` gets a matching | |
| `gpt-4o-mini-2024-07-18` entry so the cost-lookup at | |
| `provider.py:209` still resolves. Tests that pin the model string | |
| live in mock response payloads (not outgoing assertions) and the | |
| langchain baseline (separate code path) β neither affected. | |
| **2. FastAPI multi-corpus eval wiring.** `configs/default.yaml` | |
| adds `corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json`. | |
| The production serving path at `routes.py:105-120 _resolve_system_prompt` | |
| already routes `/ask` and `/ask/stream` through `format_system_prompt(label)` | |
| from `core/prompts.py` β the `app.state.system_prompt` legacy fallback | |
| (serving/app.py:276) is effectively dead code given the shipped multi-corpus | |
| config. The **only** remaining caller of `task.system_prompt` is the | |
| `scripts/evaluate.py` legacy branch used by `make evaluate-fast`. Adding | |
| the missing `golden_dataset` field makes `--corpus fastapi` work so the | |
| regression gate can measure the actual production prompt path, not the | |
| legacy eval-scaffolding prompt. Purely additive; zero blast radius on | |
| serving (serving doesn't read `golden_dataset`). | |
| **3. Pre-committed four-metric tolerances.** Written down now, before | |
| the post-edit runs, so the pass/fail call on the counterfactual-query | |
| prompt clause is not a judgment under confirmation-bias pressure. | |
| Applied identically to FastAPI and K8s: | |
| | Metric | Pass criterion | | |
| |---|---| | |
| | P@5 | post-edit β₯ pre-edit β 0.02 | | |
| | R@5 | post-edit β₯ pre-edit β 0.02 | | |
| | Citation accuracy | post-edit β₯ pre-edit (**hard gate** β any drop blocks commit) | | |
| | Mean `tool_calls_made` | post-edit β€ pre-edit + 0.30 | | |
| | Individual question cap | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit | | |
| **pilot_005 strict flip criterion (K8s-only):** | |
| - `keyword_hit_rate β₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]` | |
| - Answer cites `k8s_network_policies.md` | |
| - Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked) | |
| - Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer") | |
| **Baseline reference:** K8s pre-edit numbers from `results/k8s_preedit.json` | |
| at commit `125dac0` β P@5 0.80, R@5 1.00, citation 1.00 (all 6), | |
| mean tool_calls 1.167. FastAPI pre-edit reference established by | |
| `results/fastapi_preedit.json` in the next step of this session, | |
| same pinned ID, same refusal threshold (0.02). | |
| **Rationale for bundling.** All three sub-changes answer "what must | |
| be true before the regression measurement is valid" β drift control, | |
| evaluation path, decision criteria. Splitting into three commits | |
| would add noise without adding signal. None of them change the | |
| prompt template itself; the prompt edit is the NEXT commit and is | |
| the sole experimental variable the regression measures. | |
| ## Fix 1 (prompt-level counterfactual clause) attempted and reverted | |
| **Outcome.** K8s regression clean on every metric (P@5, R@5, KHR, | |
| citation, mean tool_calls all within tolerance or unchanged); K8s | |
| pilot_005 flipped from refusal to documented-negative-with-citation | |
| as designed (KHR 0.67 β 1.00, answer contains both "service mesh" | |
| and "ingress controller", cites `k8s_network_policies.md`). | |
| **FastAPI regression failed** on the iteration-inflation tolerance: | |
| mean `tool_calls_made` 1.111 β 1.556 (delta +0.444, gate +0.30), | |
| and two retrieval questions (q024, q025) were pushed from 1 pre-edit | |
| tool call to 3 post-edit tool calls (hitting `max_iterations=3` | |
| cap), violating the pre-committed "no new cap-hits from sub-cap | |
| baseline" criterion. | |
| **Correctness metrics on FastAPI all held.** Citation accuracy | |
| stayed at 1.000 / 1.000 across all 27 questions. P@5 delta β0.007, | |
| R@5 delta 0.000, KHR delta +0.006. The failure is purely process | |
| inflation, not output regression. q024 and q025 produce identical | |
| P@5/R@5/KHR/citation numbers pre and post despite the cap-hit β the | |
| orchestrator's "max iterations hit β one final complete() without | |
| tools" path happened to keep answers correct, but that is | |
| observation, not structural protection. | |
| **Failure mode.** The clause's trigger condition β *"your first | |
| search returned documentation about the subject of the question | |
| without addressing the specific capability or feature the user is | |
| asking about"* β relies on subjective LLM judgment about whether | |
| retrieved content "addresses" a capability. The judgment is fuzzy | |
| on compound multi-topic questions where the first search returns | |
| partial-topic coverage. q024 asks about "Docker + Gunicorn workers | |
| + health checks + Pydantic Settings"; first search returns Docker | |
| content, LLM reads "documentation about the subject without | |
| addressing the specific capability," fires the follow-up with | |
| negative framing, gets nothing useful, does a third normal search | |
| to cover the remaining topics, hits the cap. Same pattern on q025. | |
| Over-firing on this class of question is an inherent fragility of | |
| prompt-level LLM-judged triggers; a wording refinement might | |
| narrow the misfire rate but cannot eliminate it as long as the | |
| judgment itself is fuzzy. | |
| **q023 vs q024/q025 asymmetry is a useful signal for Fix 2.** q023 | |
| is a pre-existing 3-tool-call compound question ("custom error | |
| handling + CORS middleware + structured testing with dependency | |
| overrides"). Under the prompt clause, **q023 was unchanged** β the | |
| clause did not fire on it β while q024 and q025, structurally | |
| similar compound questions, were pushed into 3-tool-call cap-hit. | |
| The difference is not in question structure but in how the LLM | |
| interpreted the first-search return for each. That asymmetry is | |
| the precise reason a deterministic trigger is the right next step: | |
| any Fix 2 / Fix 3 candidate should be unit-testable against | |
| `(pilot_005, q023, q024, q025)` β the right fix must fire on | |
| pilot_005 and behave predictably on all three compound questions | |
| (either fire on all of them or none of them, but not pick them | |
| selectively by LLM whim). | |
| **Gate discipline honored.** The pre-committed FastAPI tolerances | |
| fired for exactly the reason the pre-commitment was designed: | |
| catching process-metric regressions before they ship. Tolerance- | |
| relaxation post-hoc would burn the session's strongest discipline | |
| artifact (pre-committed-tolerances + honored-gate) for marginal | |
| ship-this-approach EV. The narrow pilot_005 finding does not | |
| evaporate with the revert β chunk 63 (`d0806d5da91d6026`) is real, | |
| the negative-framing retrieval is reproducible, and Fix 2 will | |
| surface the documented negative the same way via a deterministic | |
| path. | |
| **Fix 2 deferred to a later session.** Deterministic query | |
| expansion at the `SearchTool` layer: when a `search_documents` | |
| call returns no chunk containing a direct answer string, issue a | |
| second internal search with negative-framing keywords and merge | |
| results before returning to the orchestrator. Offline-testable, | |
| corpus-agnostic, no LLM judgment required, no iteration-budget | |
| impact (the double-search happens inside a single tool call, not | |
| across iterations). Unit-testable against the | |
| `(pilot_005, q023, q024, q025)` asymmetry as an acceptance fixture. | |
| **Evidence retained.** Four result JSONs in `results/` document the | |
| regression measurement at the pinned `gpt-4o-mini-2024-07-18` | |
| snapshot in this session: | |
| - `fastapi_preedit.json` β 27 questions, HEAD prompt, 0.02 threshold | |
| - `fastapi_postedit.json` β 27 questions, clause prompt, 0.02 threshold (**gate-failing run**) | |
| - `k8s_preedit_pinned.json` β 6 pilots, HEAD prompt, 0.015 threshold | |
| - `k8s_postedit.json` β 6 pilots, clause prompt, 0.015 threshold (**gate-passing run, pilot_005 strict flip confirmed**) | |
| The previously-committed `results/k8s_preedit.json` (from `125dac0`) | |
| is also a valid K8s-pinned measurement at the session-equivalent | |
| snapshot and remains the canonical threshold-commit evidence. | |
| **Held DECISIONS.md drafts stay held.** The counterfactual-query | |
| finding draft (to be updated when Fix 2 lands) and the threshold- | |
| calibration entry already committed at `125dac0` are both correct | |
| in scope. The narrowed serving-migration deferral entry (tied to | |
| any external reference to the counterfactual-query fix) also stays | |
| deferred until Fix 2 lands, since the production/eval-harness | |
| prompt divergence is unchanged by this revert. | |
| ## Fix 2 pre-committed regression gate β SearchTool deterministic query expansion | |
| **Pre-committed BEFORE post-edit runs** (same discipline pattern | |
| that caught Fix 1's iteration inflation cleanly). | |
| **Mechanism under test.** `agent_bench/tools/search.py` | |
| `SearchTool.execute` gains a deterministic two-query retrieval | |
| path. When the primary retrieval passes the refusal gate, a | |
| secondary retrieval is issued against an expanded query | |
| (`original_query + " not supported limitations cannot"`), and the | |
| final context returned to the LLM is `primary_top_3 ++ | |
| secondary_top_5` deduplicated by `chunk.id`. Both retrievals run | |
| inside a single `SearchTool.execute` call β from the LLM's | |
| perspective, the tool schema, name, parameters, and return shape | |
| are unchanged, and the iteration budget is untouched. | |
| **Why this is architecturally different from Fix 1.** Fix 1 placed | |
| a behavioral clause in the system prompt that told the agent to | |
| issue follow-up searches itself. The trigger was an LLM judgment | |
| ("did the first search return content addressing the specific | |
| capability?") and the follow-up was a separate tool call, so it | |
| counted against `max_iterations`. Over-firing on compound questions | |
| inflated iteration counts and pushed q024/q025 to the cap. Fix 2 | |
| replaces this with a deterministic trigger (primary passes gate), | |
| a fixed expansion suffix, and a merge that happens entirely inside | |
| one tool call. No LLM judgment; no iteration change; corpus- | |
| agnostic. | |
| **Suffix choice.** `" not supported limitations cannot"`. Keyword- | |
| dense, ungrammatical on purpose β the suffix exists to shift BM25 | |
| and embedding mass toward "what you cannot do" / "limitations" | |
| sections, not to read well. The ungrammatical form is also a self- | |
| documenting signal in retrieval logs: anyone reading a query trace | |
| sees the suffix and immediately knows it is a synthetic expansion, | |
| not user input. A one-line comment in `search.py` preserves the | |
| rationale for future readers. | |
| **Merge choice.** `primary_top_3 + secondary_top_5` deduped by | |
| `chunk.id`, producing 5β8 unique chunks per call. Rationale: top-5 | |
| primary would make the expansion redundant on high-overlap queries | |
| (defeating the mechanism), while primary-top-3 guarantees the | |
| expansion always contributes to the final context window. Probe | |
| data (`/tmp/probe_fix2_v2.py`, throwaway) confirms this merge | |
| strategy surfaces pilot_005's target chunk | |
| (`d0806d5da91d6026`, chunk_index 63, "Anything TLS related ... use | |
| a service mesh or ingress controller for this") at position 6β8 in | |
| the merged list. | |
| **Opt-in flag, defaulting ON.** `SearchTool` accepts | |
| `negative_framing_expansion: bool = True`. Default is the shipping | |
| configuration because the regression gate must measure the shipping | |
| behavior, not the no-op path. A `False` default would mean the gate | |
| validates an unused parameter, and a subsequent commit flipping the | |
| default would have no regression evidence. Kill switch is preserved | |
| via explicit `False` at construction if a future regression | |
| requires an A/B comparison. | |
| **Baseline reuse.** The Fix 1 session's pre-edit JSONs | |
| (`results/fastapi_preedit.json`, `results/k8s_preedit_pinned.json`, | |
| both committed at `213da36`) were measured under the currently- | |
| committed state of the repo: pinned `gpt-4o-mini-2024-07-18`, K8s | |
| threshold 0.015, FastAPI threshold 0.02, HEAD `prompts.py` with no | |
| clause, HEAD `search.py` with no expansion. The working tree | |
| verification confirms this state is unchanged. These JSONs are | |
| therefore reused as the Fix 2 pre-edit baseline and do not need to | |
| be re-measured. Only post-edit runs are required for the Fix 2 | |
| regression (~$0.02 saved). | |
| **Pre-committed tolerances.** | |
| | Metric | Pass criterion | | |
| |---|---| | |
| | P@5 | post-edit β₯ pre-edit β 0.02 | | |
| | R@5 | post-edit β₯ pre-edit β 0.02 | | |
| | Citation accuracy | post-edit β₯ pre-edit (**hard gate** β any drop blocks commit) | | |
| | Mean `tool_calls_made` | post-edit β€ pre-edit + **0.05** (design-correctness gate β see note) | | |
| | Individual cap-hit | no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit | | |
| **Note on the tool_calls gate.** β€ +0.05 is a *design-correctness* | |
| gate, not a *performance* gate. Fix 2's invariant is that both | |
| retrievals happen inside one `SearchTool.execute` call, so the | |
| LLM's iteration count is unchanged by construction. Any non-trivial | |
| movement in `mean tool_calls_made` indicates the design invariant | |
| is broken β e.g., expansion accidentally exposed as a separate | |
| tool, or the LLM observing two-call behavior and adapting its | |
| strategy. The gate fires on design violation, not on performance | |
| regression. The 0.05 absolute threshold absorbs legitimate run-to- | |
| run variance from non-determinism in the LLM even at temperature | |
| 0, without absorbing real iteration-count movement. | |
| **pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1 | |
| gate):** | |
| - `keyword_hit_rate β₯ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]` | |
| - Answer cites `k8s_network_policies.md` | |
| - Answer contains "service mesh" OR "ingress controller" | |
| - Answer does NOT begin with refusal phrasing | |
| **Baseline reference for the gate.** | |
| | Corpus | Pre-edit source | P@5 | R@5 | Citation | Mean tool_calls | | |
| |---|---|---|---|---|---| | |
| | FastAPI (27) | `results/fastapi_preedit.json` @ `213da36` | 0.585 | 0.679 | 1.000 | 1.111 | | |
| | K8s (6 pilots) | `results/k8s_preedit_pinned.json` @ `213da36` | 0.800 | 1.000 | 1.000 | 1.167 | | |
| **Post-edit filenames (to be produced).** | |
| - `results/fastapi_postedit_fix2.json` | |
| - `results/k8s_postedit_fix2.json` | |
| **If the gate passes:** commit Fix 2 with `search.py` change, unit | |
| tests (including the tool-spec snapshot test), the two post-edit | |
| result JSONs, and this DECISIONS.md entry extended with the | |
| regression outcome. | |
| **If the gate fires:** revert, document the failure mode, surface | |
| the specific criterion that fired. No tolerance relaxation β same | |
| discipline pattern as Fix 1 revert. | |
| ## Fix 2 outcome β mechanism works, response-style criterion fired, reverted | |
| **Regression runs produced.** Two post-edit runs on K8s (FastAPI not | |
| run β K8s findings gated the decision before API spend on the | |
| broader set): | |
| | Run | Merge rule | File | Purpose | | |
| |---|---|---|---| | |
| | Fix 2 v1 | `primary[:3] + secondary[:5]` | `results/k8s_postedit_fix2.json` | Initial implementation | | |
| | Fix 2 v2 | `primary[:5] + secondary[:5]` | `results/k8s_postedit_fix2_merge_v2.json` | Path A refinement after v1 failed P@5 on a metric-definition mismatch | | |
| **v1 findings.** Aggregate: P@5 0.800 β 0.767 (Ξ β0.033, **FAILED** | |
| the P@5 β₯ β0.02 tolerance). The failure traced to a merge-rule / | |
| metric-semantics interaction: `retrieval_precision_at_k` computes | |
| precision on `retrieved_sources[:5]`, and with `primary[:3] + | |
| secondary[:5]` the first 5 entries were `primary_top_3 + | |
| secondary_top_2`. For pilot_005, `secondary[1]` was | |
| `k8s_pods.md` (chunk_index 40, surfaced because the reranker | |
| matched its "localhost communication" content against the expanded | |
| query). That single off-source chunk in position 5 dropped P@5 | |
| from 1.00 to 0.80 for pilot_005 and similarly for pilot_006. | |
| Iteration invariant held (tool_calls 1.167 β 1.167). Citation | |
| accuracy held (1.000 β 1.000). Target chunk | |
| (`d0806d5da91d6026`, "Anything TLS related") reached the LLM | |
| context for pilot_005 at merged position 7. | |
| **Path A refinement (merge v2).** Change `primary[:3] + | |
| secondary[:5]` β `primary[:5] + secondary[:5]`. Rationale: | |
| primary_top_5 is preserved in positions 1β5 by construction, so | |
| P@5 computed on `ranked_sources[:5]` is unchanged from the | |
| no-expansion baseline. Expansion chunks land in positions 6β10. | |
| Target chunk still reaches LLM context (position 9 for pilot_005). | |
| This is an **implementation refinement, not a tolerance | |
| relaxation** β the pre-committed gate thresholds stand; only the | |
| merge rule was adjusted to respect the metric's window semantics. | |
| **v2 findings β perfect metric preservation, but strict-flip fails on response style.** | |
| Aggregate: | |
| | Metric | Pre-edit | Fix 2 v2 | Delta | | |
| |---|---|---|---| | |
| | P@5 | 0.800 | 0.800 | **0.000** | | |
| | R@5 | 1.000 | 1.000 | 0.000 | | |
| | KHR | 0.806 | 0.806 | 0.000 | | |
| | Citation accuracy | 1.000 | 1.000 | 0.000 | | |
| | Mean `tool_calls_made` | 1.167 | 1.167 | **0.000** | | |
| Every aggregate metric **literally unchanged**. Per-question | |
| deltas: zero on every metric, every question. The design | |
| invariant (iteration budget unchanged, tool schema unchanged, | |
| refusal gate behavior unchanged) holds perfectly. | |
| **But pilot_005 strict flip fails on the refusal-phrasing criterion.** | |
| Post-edit answer: | |
| > *"The Kubernetes documentation does not provide specific | |
| > instructions on configuring a NetworkPolicy to enforce mutual TLS | |
| > (mTLS) between Pods in the same namespace. For mTLS, it is | |
| > generally recommended to use a service mesh or other proxy | |
| > solutions, as NetworkPolicy alone does not handle TLS | |
| > configurations directly [source: k8s_network_policies.md]."* | |
| The answer substantively contains the documented negative with | |
| citation. But it opens with *"The Kubernetes documentation does | |
| not provide specific instructions..."* β the exact refusal- | |
| phrasing opener the strict-flip criterion was pre-committed to | |
| reject. The criterion exists because the brand is honest | |
| evaluation: an answer that opens apologizing that the | |
| documentation "does not provide specific instructions" reads, to | |
| a technical reviewer, like the system failed to find the answer | |
| and is papering over the gap, even though the facts and citation | |
| are present. The criterion fired as designed. | |
| **Compare to Fix 1 post-edit answer (from `213da36` evidence):** | |
| > *"Kubernetes NetworkPolicy does not support enforcing mutual TLS | |
| > (mTLS) directly. The documentation states that anything TLS | |
| > related should be handled using a service mesh or ingress | |
| > controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."* | |
| Fix 1's answer asserts a fact about **NetworkPolicy** ("does not | |
| support"); Fix 2's answer asserts a fact about **the documentation** | |
| ("does not provide instructions"). The first forecloses the | |
| capability; the second leaves open whether the capability exists | |
| somewhere the system didn't see. That distinction is load-bearing | |
| for any grounded-refusal narrative, and it separates a system that | |
| handles documented negatives crisply from one that hedges around | |
| them. | |
| **Diagnosis.** Fix 2's mechanism successfully gets the target chunk | |
| into the LLM's context window β the retrieval side of the problem | |
| is solved. What Fix 2 **cannot provide** is explicit guidance on | |
| how to phrase the documented negative once the chunk is present. | |
| Fix 1's prompt clause was doing that guidance work; removing the | |
| clause and relying on the LLM's unaided response style produces a | |
| hedging answer because the LLM, seeing both NetworkPolicy-spec | |
| content and a TLS limitation bullet, defaults to contextual | |
| hedging rather than crisp assertion. | |
| **Fix 2 is therefore not an alternative to Fix 1's prompt clause | |
| β it is a prerequisite.** Fix 2 guarantees the chunk reaches | |
| context; a future "Fix 2 + targeted prompt clause" stack could | |
| resolve both the retrieval gap and the response-style gap without | |
| Fix 1's over-firing problem, because the clause would no longer | |
| need to direct the agent to do a follow-up search (Fix 2 handled | |
| that). The over-firing on compound questions that broke Fix 1 was | |
| caused by the agent deciding to do extra search iterations under | |
| LLM judgment; if the expansion already happened deterministically | |
| inside the first tool call, the clause has less work to do and | |
| may not trigger the second-LLM-call pattern at all. **Speculative | |
| and not for this session.** Future work item. | |
| **Gate verdict: failed on pilot_005 strict flip criterion.** | |
| Reverting, same Fix-1 pattern. | |
| **What this commit contains.** | |
| - `agent_bench/tools/search.py` **reverted** to HEAD (no Fix 2 | |
| code changes) | |
| - `tests/test_tools.py` retains the `MockChunk.id` hygiene fix | |
| (the real `Chunk` class has `id`; mock should match the real API | |
| for future test authors) | |
| - `tests/test_tools.py` adds `TestSearchToolSpecSnapshot`: a | |
| general-purpose guard that freezes `SearchTool`'s LLM-facing | |
| contract (name, description, parameters). The lesson from Fix 2 | |
| is that any future refactor exposing internal SearchTool state | |
| to the LLM would break iteration-budget invariants β the | |
| snapshot test catches that at test time, independent of whether | |
| Fix 2 lands. | |
| - Two regression evidence JSONs: `results/k8s_postedit_fix2.json` | |
| (v1, the P@5 failure) and `results/k8s_postedit_fix2_merge_v2.json` | |
| (v2, the strict-flip failure). Retained as the measurement | |
| trail behind the revert decision. | |
| - This DECISIONS.md entry (pre-committed gate + outcome + revert | |
| narrative). | |
| **What this commit does NOT contain.** No changes to | |
| `agent_bench/tools/search.py`, `agent_bench/core/prompts.py`, or | |
| `configs/default.yaml`. Both Fix 1 (prompt clause) and Fix 2 | |
| (SearchTool expansion) have been attempted and reverted this | |
| session. Three commits of progress nonetheless: `125dac0` | |
| (threshold calibration, empirical), `5c1f49f` (prep bundle: model | |
| pin + fastapi wire + Fix 1 pre-committed tolerances), `213da36` | |
| (Fix 1 revert narrative). The threshold calibration and model pin | |
| are real, shipped, measurement-grounded infrastructure changes. | |
| The two fix attempts are documented learning that shapes the | |
| future direction. | |
| ## `grounded_refusal` metric reads answer text, not retrieved sources β 2026-04-14 | |
| **Context.** Week 1 step 5 authoring (25-question K8s golden set). Two | |
| flavor-A out-of-scope questions (`k8s_004` Jaeger sidecar, `k8s_024` | |
| Envoy xDS ADS) surfaced a pre-existing bug in the | |
| `grounded_refusal` metric during the functional check. | |
| **Bug 1 β wrong signal.** The metric's docstring said it checks | |
| whether the answer correctly refuses AND cites no sources, but the | |
| implementation was checking `len(response_sources) == 0` where | |
| `response_sources` is the *retrieved*-sources list. Real agents | |
| retrieve candidates on any non-trivial OOS query (the grounded-refusal | |
| gate at tool level only catches the thinnest queries), inspect the | |
| candidates, find nothing relevant, and refuse *in the answer text* | |
| without citing anything. Checking retrieval emptiness flagged those | |
| correct refusals as failures. Fix: inspect the answer text for | |
| `[source: X.md]` citations via regex; drop the `response_sources` | |
| parameter from the signature entirely. | |
| This was a silent false negative on all 5 fastapi out-of-scope | |
| questions (`q008`β`q010`, `q026`β`q027`) which all correctly refuse | |
| but were being marked `grounded_refusal=False`. Aggregate | |
| `refusal_rate` in `report.py` shifts by the resulting 5-question | |
| delta; any historical comparison to pre-fix fastapi numbers needs | |
| to acknowledge this. | |
| **Bug 2 β metric coverage gap surfaced during 25-question authoring.** | |
| `grounded_refusal_rate` recognized "does not contain information" | |
| phrasing (in `refusal_phrases` list) but missed "not in the | |
| {corpus_label} documentation" phrasing β the exact shape taught by | |
| the system prompt at `core/prompts.py:17-18`. The LLM produced the | |
| canonical form on some questions and the phrase-list form on others; | |
| the metric inflation/deflation was non-deterministic. Fix: narrow | |
| regex `\bnot in the\b[^.]{0,60}\bdocumentation\b` added alongside | |
| phrase-list matching. | |
| **Rejected alternative.** Substring `"not in the"` would produce | |
| false positives on valid-answer phrasing β "the rate limit is not in | |
| the same scope as the request timeout", "the flag is not in the 1.28 | |
| release; it landed in 1.29", "this value is not in the default | |
| range" β all of which are legitimate retrieval answers with | |
| conditional or scope-limiting language, not refusals. Honest | |
| evaluation cannot afford a metric that silently counts these as | |
| grounded refusals. | |
| **Tests.** Two unit tests pin both directions: | |
| `test_canonical_refusal_phrasing_recognized` covers the positive | |
| case ("The answer is not in the Kubernetes documentation"), and | |
| `test_not_in_the_is_not_substring_refusal` covers the negative case | |
| ("The rate limit is not in the same scope as the request timeout"). | |
| The negative test is the load-bearing one β without it, a future | |
| refactor could silently widen the matcher back to substring and pass | |
| all existing tests. The negative test pins design intent. | |
| **Scope bound.** This is a metric correctness fix, not a threshold | |
| change. The 0.015 refusal-gate threshold (calibrated in `125dac0` | |
| against the 6-question pilot) is unchanged by this commit. Whether | |
| the corrected metric shifts the optimal threshold against the full | |
| 25-question set is a question for the threshold-sweep session, not | |
| this authoring session. | |
| ## Parallel tracks / deferred items β 2026-04-14 | |
| Tracked list of work items that are deferred to parallel sessions. | |
| Each item has a reason for deferral and a rough scope boundary so | |
| the session that picks it up has the context to pre-commit tolerances | |
| and decision criteria before measuring. | |
| 1. **`routes.py:552` audit-logger semantics unification.** The | |
| serving layer's audit record field still uses the pre-fix | |
| `grounded_refusal = not bool(sources)` expression, which disagrees | |
| with the evaluation metric's answer-text-based definition. Not | |
| surfaced to the dashboard (audit log only), but external reviewers | |
| who reference audit records for runtime verification would see a | |
| different definition than the benchmark claims. Fix: call | |
| `grounded_refusal(answer, category)` from `metrics.py` directly. | |
| When this lands, the "grounded_refusal metric" DECISIONS.md entry | |
| above should get a one-line addendum noting the unification. | |
| 2. **Full 25Q threshold sweep β production-target `refusal_threshold` | |
| for K8s.** The 25Q set exists, the metric is correct. Sweep | |
| against the full set, compare to pilot-floor 0.015, pick the | |
| production-target value, update `configs/default.yaml` placeholder | |
| comment. Pre-commit before measuring: sweep range, decision | |
| criteria, tolerances. Do not entangle with flavor-B response-style | |
| work below β those are independent axes. | |
| 3. **Flavor-B response-style class (pilot_005 + k8s_022).** Two | |
| independent reproductions of "LLM refuses when documented negative | |
| is in retrieved context". Retrieval is healthy on both; the gap | |
| is prompting. Future session: Fix 2 (counterfactual-query | |
| expansion in `SearchTool`) + targeted prompt clause stacked β | |
| previously speculative in the Fix 2 revert entry, now addresses | |
| a documented reproducible class. Two reproductions, not one-off. | |
| 4. **Serving-migration deferral.** Tied to external references to | |
| the counterfactual-query fix. Unchanged from prior sessions. | |
| 5. **`agent-bench` β `refusal-bench` rename β CLOSED 2026-04-14.** | |
| Decision: keep `agent-bench`, reframe via tagline. The original | |
| concern was name collision with AgentBench (Liu et al., ICLR | |
| 2024, ~1000 citations). Due-diligence at launch time: the name | |
| is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase), | |
| which are distinct identifiers across GitHub, arXiv, and PyPI. | |
| The two projects target different audiences (LLM-as-agent | |
| capability vs. RAG+refusal benchmark) and any reviewer reaching | |
| the repo via LinkedIn or CV sees the scope in the README within | |
| seconds. Rename cost is substantial (~350 internal references | |
| across ~60 files, two external account renames, one HF Space | |
| URL break with no redirect) for a naming-precision benefit that | |
| isn't supported by the actual scope β the benchmark measures | |
| retrieval, grounding, multi-hop, citation accuracy, and refusal | |
| as seven axes, not refusal alone. Tagline reframe captures the | |
| honest-evaluation positioning without the rename cost: | |
| > "A RAG benchmark built from primitives, with honest | |
| > evaluation of retrieval, refusal, and grounded citation." | |
| HF Space rename (`Nomearod/agentbench` β `Nomearod/agent-bench` | |
| for GitHub-name consistency) is a separate, smaller follow-up | |
| deferred approximately one week. Reason: several job | |
| applications submitted the preceding week reference the current | |
| HF URL (`nomearod-agentbench.hf.space`); renaming the Space now | |
| would break those inbound links with no HF-side redirect. The | |
| rename absorbs cleanly once the application wave lands and the | |
| reference window expires. Until then the README, dashboard, and | |
| DECISIONS.md continue to reference the current `agentbench` URL; | |
| launch-adjacent work (Post #1, screenshots, cold-start measure) | |
| uses the current URL and will be updated in a single small | |
| follow-up commit when the rename happens. | |
| 6. **OpenAI snapshot drift bisection.** Mar 25 β Apr 12 P@5 slide; | |
| the model pin at `5c1f49f` (`gpt-4o-mini-2024-07-18`) removed | |
| the ongoing drift risk, so any future measurement is apples-to- | |
| apples. The original bisection is still unresolved but cheap at | |
| this point β tractable whenever there is session capacity, low | |
| urgency because the pin protects forward runs. | |
| 7. **Fix 2 revert commit SHA missing from the Fix 2 outcome entry.** | |
| The "Fix 2 outcome β mechanism works, response-style criterion | |
| fired, reverted" DECISIONS.md entry describes the revert | |
| narratively but does not cite the revert commit's SHA | |
| (post-rewrite: `27c2e17` β `docs(eval): Fix 2 SearchTool query | |
| expansion β attempted and reverted`). Add retroactive SHA | |
| reference in the next docs pass. Not urgent; noted so the | |
| narrative-without-SHA pattern does not spread to other entries. | |
| **Lesson going forward:** prefer explicit SHAs over positional | |
| references like "this commit" / "commit above" in DECISIONS.md | |
| entries β positional references do not survive history rewrites | |
| as robustly as SHA references do. | |
| ## K8s refusal_threshold sweep against 25-question golden β 2026-04-14 | |
| **Override notice.** This sweep ran in the same session as the | |
| 25-question authoring + grounded_refusal metric fix (`4454894`), | |
| after I explicitly flagged that the parallel-tracks guidance from | |
| earlier in the session recommended waiting for a fresh session with | |
| pre-commitment discipline. The user issued an explicit override: | |
| "proceed on best-judgment sweep range and criteria" β logged here | |
| for audit trail. The pre-commitment frame below was drafted BEFORE | |
| running any sweep value, not after. The decision criteria were | |
| locked before the first data point was observed, not retrofitted. | |
| **Sweep grid.** 4 threshold values: `0.010`, `0.015` (already | |
| measured in `.cache/eval_k8s_full25_postfix.json`, the post-metric- | |
| fix run from `4454894`), `0.020`, `0.025`. | |
| - `0.010`: one tick below current calibration; sanity-check floor. | |
| - `0.015`: current calibration (pilot-floor, one tick below | |
| pilot_005's 0.01639 max_score). | |
| - `0.020`: matches legacy FastAPI threshold and the original | |
| provisional K8s default before the `125dac0` calibration. | |
| - `0.025`: one tick above legacy; exploration of whether aggressive | |
| OOS short-circuiting is worth the correctness risk. | |
| **Decision criteria (pre-committed).** | |
| 1. **OOS refusal must hold.** Both `k8s_004` (Jaeger) and `k8s_024` | |
| (Envoy xDS) must retain `grounded_refusal=True` at the chosen | |
| threshold β whether the gate fires at the tool level or the | |
| LLM refuses after inspecting context doesn't matter, only that | |
| the metric reports True. | |
| 2. **Retrieval recall must not degrade.** Each retrieval-category | |
| question's R@5 at the chosen threshold must be β₯ its R@5 at | |
| `0.015` (the post-fix-25Q baseline) with a noise tolerance of at | |
| most ONE question dropping by at most 0.20. Two or more drops, | |
| or any drop > 0.20, disqualifies the value. | |
| 3. **Citation accuracy must hold.** All questions' citation_accuracy | |
| must be β₯ 0.95 at the chosen threshold. One question at 0.80 is | |
| noise-tolerated; two or more is a hard stop. | |
| 4. **k8s_022 (flavor-B) retrieval must remain at R@5=1.0.** The | |
| gap is prompting-side, not retrieval-side; any threshold that | |
| breaks the already-working retrieval on flavor-B questions is | |
| a regression. | |
| 5. **Pick the highest threshold that satisfies 1β4.** Rationale: | |
| a higher threshold short-circuits more OOS queries at the tool | |
| level, saving a retrieval round trip and an LLM call β this is | |
| a real latency and token-cost win when the correctness is held. | |
| 6. **Tie-break.** If multiple values all satisfy 1β4, prefer the | |
| value closest to a clean round number (0.020 over 0.018) for | |
| documentation clarity. | |
| 7. **Floor.** If no threshold > 0.015 satisfies 1β4, keep 0.015. | |
| No threshold < 0.015 will be chosen regardless β sub-0.015 is | |
| strictly less protective than the pilot-floor. | |
| **Scope bound.** K8s only; FastAPI's `refusal_threshold: 0.02` is | |
| unchanged. The flavor-B response-style gap (parallel track #3) is | |
| NOT a sweep variable β changing the threshold does not fix LLM | |
| phrasing; that's the Fix 2 + prompt guidance stacked experiment | |
| the parallel-tracks list already defers. | |
| **Measured results.** All four runs use the post-metric-fix pipeline | |
| (grounded_refusal metric from `4454894`), deterministic mode, | |
| `gpt-4o-mini-2024-07-18`, same retriever config. | |
| | threshold | avg R@5 | OOS refusal | gate fired on | broken retrieval | | |
| |-----------|---------|-------------|-----------------------------------|------------------------| | |
| | 0.010 | 0.957 | 2/2 | β | β | | |
| | 0.015 | 0.957 | 2/2 | β | β | | |
| | 0.020 | 0.870 | 2/2 | k8s_006, k8s_007, k8s_024 | k8s_006, k8s_007 (R@5=0.00) | | |
| | 0.025 | 0.913 | 2/2 | k8s_004, k8s_007, k8s_024 | k8s_007 (R@5=0.00) | | |
| **Structural finding: LLM query variance makes max_scores non-deterministic.** | |
| At 0.020, `k8s_006` (ConfigMap, simple) gate-fired β empty retrieval β | |
| R@5=0.00. At 0.025, `k8s_006` did NOT gate-fire β 5 sources β R@5=1.00. | |
| A higher threshold producing fewer gate-fires is physically impossible | |
| if retrieval is deterministic β the SearchTool receives different | |
| queries across runs because the orchestrator issues LLM-generated | |
| queries, and the same question can produce different top-k max_scores | |
| run-to-run. `k8s_006`'s max_score for the query the LLM chose lives | |
| somewhere around the 0.018β0.025 boundary; which side of any given | |
| threshold it lands on depends on which query the LLM wrote. | |
| This means **any threshold above 0.015 is structurally fragile**, not | |
| merely "failed on this run." Even if a run at 0.018 passed, a future | |
| run could gate-fire on `k8s_006` or `k8s_007` because the query is | |
| non-reproducible. The production threshold needs to sit below all | |
| legitimate simple-question max_scores with enough margin to absorb | |
| LLM query variance. | |
| **Decision: keep `refusal_threshold: 0.015`.** | |
| - `0.010`: meets all criteria, identical measured metrics to `0.015` | |
| (avg R@5=0.957, OOS refusal 2/2, no citation fails). Not chosen: | |
| lowering strictly weakens the gate's ability to catch low- | |
| confidence retrievals without improving any measured metric. | |
| - `0.015`: chosen. Meets all criteria and is the highest value that | |
| does not degrade retrieval β which is the definition of the | |
| correct refusal-gate threshold. Preserving the gate's signal is | |
| the gate's purpose; `0.015` gives maximum gate strength without | |
| cost, `0.010` gives the same measurable behavior with less gate | |
| signal, so `0.015` dominates. | |
| - `0.020`: breaks TWO retrieval questions (`k8s_006`, `k8s_007`); | |
| disqualified per criterion 2. | |
| - `0.025`: breaks ONE retrieval question in this run (`k8s_007`) | |
| but the non-determinism finding means a future run could break | |
| more. Even ignoring non-determinism, still disqualified by the | |
| citation-accuracy-equivalent drop on `k8s_007`. | |
| **Corpus characteristic finding.** The 0.020 default inherited from | |
| FastAPI breaks on K8s because K8s retrieval score distributions are | |
| lower for "easy" questions. `k8s_006` ("What is a ConfigMap?") and | |
| `k8s_007` ("What does a Kubernetes Job do?") are both `type: simple` | |
| with clean single-source expected answers β exactly the cases where | |
| BM25+embedding scores should be highest. They land at max_scores in | |
| the ~0.018 range, below the FastAPI-calibrated 0.020 default. This | |
| is **not an authoring bug** β both questions retrieve their | |
| `expected_sources` correctly when the gate doesn't fire. It's a | |
| corpus characteristic: K8s documentation has more topic-overlap | |
| across pages than FastAPI, diluting top-k concentration. | |
| The 25-question set exposed this because the 6-question pilot had | |
| no simple questions with low max_scores β the pilot was drawn from | |
| retrieval-stressful areas (comparison, multi-hop, flavor-B). The | |
| 25-question authoring deliberately added simple questions to hit | |
| the CRAG distribution target (6 simple, 5β6 target), and those | |
| simple questions revealed the corpus-characteristic floor. | |
| **Config change.** `configs/default.yaml` `corpora.k8s.refusal_threshold` | |
| comment updated to reference this sweep. Value unchanged at `0.015`. | |
| **Not in scope.** (a) Adding retry-with-query-variance to the | |
| SearchTool to reduce max_score variance β separate session, affects | |
| other corpora. (b) Tuning FastAPI's threshold against its golden | |
| set β the FastAPI default was empirically fine on its own 30Q set | |
| and is not a documented regression. (c) Fixing the `k8s_015` | |
| R@5=0.50 value observed across all threshold runs β pre-existing | |
| authoring state from `4454894`, tracked separately if it becomes | |
| a concern on future runs. | |
| **Narrative summary.** Session hypothesis: pilot_005 is a | |
| counterfactual-query-expansion problem. Session evidence: the | |
| hypothesis is correct on retrieval β the target chunk is reachable | |
| via negative-framing queries and Fix 2 surfaces it deterministically | |
| with zero iteration-budget impact. Session evidence also shows the | |
| hypothesis is **incomplete** β retrieval-only fixes cannot close | |
| the response-style gap, because the LLM under unaided prompting | |
| hedges when a documented negative is surrounded by unrelated | |
| topical content. A future session exploring **Fix 2 + targeted | |
| prompt guidance stacked** is the natural next experiment; this | |
| session's pilot-first discipline has been preserved against two | |
| distinct pre-committed gates, both firing for the reasons they | |
| were designed to catch. | |
| ## Credential-exposure incident and history rewrite β 2026-04-14/15 | |
| **Summary.** During Week 1 work on the | |
| `feat/user-friendly-landing-page-live-dashboard` branch, an | |
| `instruction.txt` file containing plaintext OpenAI and Anthropic | |
| API keys was accidentally committed at pre-rewrite SHA `2b3150f` | |
| (`style: fix ruff lint β import sorting, line length`) and removed | |
| from the working tree in a later commit (pre-rewrite SHA `3a2c5ef`, | |
| `security: remove instruction.txt containing plaintext credentials`). | |
| The removal did not clean git history β the keys remained accessible | |
| via `git show 2b3150f:instruction.txt` in local history. | |
| **Discovery.** The issue was discovered when GitHub push protection | |
| rejected the first push of the branch to the `origin` remote, | |
| flagging the credentials via its secret-scanning system. The branch | |
| had never been pushed to any public remote prior to the rewrite; | |
| the detection fired on the very first push attempt, which is the | |
| correct moment for secret-scanning to act. Honest credit to the | |
| tooling: GitHub's push protection did exactly what it was designed | |
| to do, and the alternative failure mode (silent push of real | |
| credentials to a public repo) did not occur. | |
| **Immediate actions, in order.** | |
| 1. **Key rotation.** Rotated both OpenAI and Anthropic keys at the | |
| respective provider dashboards, revoking the exposed values | |
| immediately. Rotation was confirmed before any git operation | |
| ran β the reasoning was that the keys were exposed on the local | |
| disk regardless of whether they ever made it to a public remote, | |
| so the exposure window needed to be closed first. | |
| 2. **Unauthorized-use check.** Verified billing/usage dashboards on | |
| both OpenAI and Anthropic for the exposure window (from commit | |
| `2b3150f` landing until rotation). No unauthorized activity | |
| observed on either account. | |
| 3. **Local `.env` update and smoke test.** Updated local `.env` | |
| with the new keys. Verified both worked via minimal API calls | |
| that return only HTTP status codes (never the key values | |
| themselves): `GET /v1/models` for OpenAI (200), `POST /v1/messages` | |
| with a 1-token request for Anthropic (200). Total verification | |
| cost: <$0.0001. | |
| 4. **Repository backup.** Before running any history-rewriting | |
| command, backed up the entire repository via `rsync -a` to | |
| `/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>`, | |
| excluding only `.mypy_cache` and `.cache` (both derivative, | |
| regenerable, and explicitly `.gitignore`'d). The backup preserved | |
| `.git/`, all four worktree state files under `.git/worktrees/`, | |
| the `.worktrees/` checkouts themselves, and all tracked source | |
| files. The backup is the safety net if the rewrite had gone | |
| wrong in any way; this session never needed to consult it. | |
| 5. **History rewrite via `git filter-repo`.** Ran | |
| `git filter-repo --path instruction.txt --invert-paths --force` | |
| on the main clone. The `--force` flag was required because | |
| filter-repo's default safety check refuses to run on non-fresh | |
| clones; the backup step above mitigates the risk that this flag | |
| is usually guarding against. 186 commits were parsed and | |
| rewritten in ~2.4 seconds; filter-repo's internal repacking | |
| completed in an additional ~5 seconds. The `origin` and `hf` | |
| remotes were automatically unset by filter-repo as its standard | |
| safety behavior (and restored from a saved file before the push). | |
| 6. **Dropped empty commit.** Pre-rewrite commit `3a2c5ef` (which | |
| removed `instruction.txt` from the working tree but did not | |
| clean history) became empty after filter-repo stripped the file | |
| from all prior commits and was dropped automatically. This is | |
| correct filter-repo behavior: the commit's only net effect was | |
| to remove a file that no longer exists in any predecessor, so | |
| post-rewrite it has no content change and is elided from the | |
| linear history. The total commit count went from 186 β 185. | |
| Pre-rewrite SHA `3a2c5ef` maps to `00000...00000` in | |
| `.git/filter-repo/commit-map`, indicating the drop. The dropped | |
| SHA was not referenced anywhere in DECISIONS.md, so the drop | |
| had zero audit-trail impact. | |
| 7. **Multi-layer verification sweep.** Ran six checks across every | |
| location where the credentials could still be present: | |
| (a) `git log --all --full-history -- instruction.txt` returned | |
| empty; (b) `git rev-list --all --objects | grep instruction.txt` | |
| returned 0 matches; (c) `git reflog --all` was empty after | |
| `git reflog expire --expire=now --all`; (d) `git fsck | |
| --unreachable` returned clean; (e) `git stash list` was empty; | |
| (f) a precise key-value regex scan across all blobs in the | |
| rewritten object database (`sk-[A-Za-z0-9]{30,}`, | |
| `sk-ant-[A-Za-z0-9]{20,}`, and env-var-assignment patterns) | |
| found 23 matches, **all verified to be non-secret content** | |
| β specifically: 15 historical README.md blobs containing the | |
| documentation placeholder `ANTHROPIC_API_KEY=sk-ant-...` | |
| (with three literal dots), 7 historical `docs/provider_comparison.md` | |
| blobs with the same documentation placeholder pattern, and 1 | |
| `tests/test_output_validator.py` blob containing test fixtures | |
| that intentionally use mock key-shaped strings to verify the | |
| output-validator's secret-redaction logic. The precise scan is | |
| a meaningful check: it demonstrates that the exposure was | |
| isolated to `instruction.txt` and did not spread via copy-paste | |
| of the key values into other files before removal. | |
| 8. **Worktree walk.** All four worktrees (`feat-infra-sprint`, | |
| `feature-grounded-refusal`, `langchain-baseline`, | |
| `security-hardening`) were checked for `instruction.txt` history | |
| pollution and for uncommitted changes. All four were clean β | |
| no pollution in any branch's history (filter-repo operates on | |
| all refs in a shared `.git/`, so the worktrees were reached | |
| through the main clone's object database) and no local dirty | |
| state in any working tree. No worktree deletion or recreation | |
| was needed. | |
| 9. **DECISIONS.md SHA remap.** The filter-repo operation rewrote | |
| every commit's SHA downstream of the first rewritten commit. | |
| This broke every explicit SHA reference in DECISIONS.md because | |
| those references pointed to pre-rewrite SHAs that no longer | |
| exist. The remap used `.git/filter-repo/commit-map` as the | |
| authoritative SHA-based mapping (not message-based pairing, | |
| which would have been vulnerable to duplicate-message | |
| ambiguity β 2 pairs of commits in the pre-rewrite history did | |
| in fact have identical messages, though neither was in the | |
| substitution set). Four unique old SHAs were remapped across | |
| 18 substitution sites: | |
| | OLD (pre-rewrite) | NEW (post-rewrite) | Commit role | | |
| |---|---|---| | |
| | `bd2b913` | `213da36` | Fix 1 counterfactual prompt clause revert | | |
| | `b97f00f` | `125dac0` | K8s refusal_threshold 0.02 β 0.015 calibration | | |
| | `77017db` | `5c1f49f` | pin gpt-4o-mini snapshot + wire fastapi golden | | |
| | `526be18` | `4454894` | Week 1 step 5 β 25Q golden + grounded_refusal fix | | |
| Every message matched exactly across the oldβnew pairing; no | |
| new SHA prefix collides with any old SHA prefix; post-remap | |
| grep confirmed zero remaining references to any old SHA. | |
| **Exposure scope assessment.** The branch had never been pushed | |
| to any public remote prior to the rewrite. The credentials existed | |
| in: | |
| - Local git history at `/Users/zenith/Desktop/agent-bench/.git/` (cleaned) | |
| - Four worktree clones sharing the same `.git/` (cleaned via the main repo) | |
| - The rsync backup at | |
| `/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>` | |
| (to be deleted after this commit and test suite confirm the | |
| rewrite is correct) | |
| No external exposure via GitHub, HF Spaces, or any other shared | |
| system occurred. No cached CI artifacts contain the keys because | |
| CI only runs on pushed branches and this branch was never pushed. | |
| No forks or clones exist outside the local machine. GitHub's | |
| push-protection detection itself touched the key strings during | |
| the rejected push attempt, but GitHub's secret scanning is trusted | |
| infrastructure and the rejection is the good outcome, not an | |
| additional exposure event. | |
| **Why this entry exists.** Credential hygiene failures are worth | |
| documenting, not hiding. A reviewer who reads this entry sees a | |
| developer who: made a mistake, caught it via automated tooling | |
| working as designed, rotated keys before touching git, rewrote | |
| history surgically with a backup as the safety net, verified the | |
| rewrite across six independent checks, and preserved audit-trail | |
| integrity through the SHA remap. The honest-evaluation brand | |
| extends to credential-handling incidents β the alternative of | |
| pretending this didn't happen, or silently unblocking the secret- | |
| scanning rejection to push exposed values to a public repo, would | |
| be a strictly worse outcome for both security posture and brand | |
| credibility. | |
| **Procedural lessons for DECISIONS.md going forward.** Prefer | |
| explicit commit SHAs over positional references like "this commit" | |
| or "commit above" β positional references do not survive history | |
| rewrites as robustly as explicit SHAs do. The "Fix 2 outcome" | |
| entry above was identified during this incident as missing an | |
| explicit SHA reference to the Fix 2 revert commit (post-rewrite | |
| SHA `27c2e17`); this is tracked as parallel-tracks item #7 for a | |
| retroactive fix in the next docs pass. | |
| ### Round 2 β Google API key format in a test fixture | |
| After the round-1 rewrite was complete and the feature branch had | |
| been pushed to `origin` for the first time, GitHub secret scanning | |
| raised a second alert (alert #1, `secret_type: google_api_key`) | |
| against `tests/test_output_validator.py` line 152 at pre-round-2 | |
| commit `8ebe3964af7d` (`security: fail-closed on secret extraction | |
| and env var leakage`). The alert was on a test fixture inside a | |
| `@pytest.mark.parametrize` list, structurally consistent with the | |
| other fake fixtures in the same list (OpenAI `sk-test123`, | |
| Anthropic `sk-ant-xyz`, AWS `AKIAIOSFODNN7EXAMPLE`). The Google | |
| fixture, however, was 35 chars after the `AIza` prefix and matched | |
| both GitHub's detection pattern and the output validator's own | |
| detection regex exactly. | |
| **Disambiguation.** Asked whether the string was a hand-typed fake | |
| or a real-leaked Google API key, the developer confirmed: (1) yes, | |
| a Google API key had been created at some point in a GCP or | |
| Google AI Studio context unrelated to this project, and (2) no, | |
| the string on line 152 was not recognizably hand-typed. Combined | |
| with the structural inconsistency against the other clearly-fake | |
| fixtures in the same parametrize list, the safe interpretation | |
| was to treat it as potentially real and rotate + rewrite rather | |
| than dismiss as false positive. | |
| **Actions, in order.** | |
| 1. **Google API key rotation.** All Google API keys on the | |
| developer's GCP and Google AI Studio accounts rotated at the | |
| provider dashboards, regardless of which specific key matched | |
| line 152, because the specific match was not known with | |
| certainty. Rotation confirmed before any git operation. | |
| 2. **Billing/activity check.** Verified Google Cloud billing and | |
| API activity on every project for the window since commit | |
| `8ebe3964af7d` landed (2026-04-12 18:18). No unauthorized | |
| activity observed. | |
| 3. **Why the validator regex and GitHub's detector are identical.** | |
| The output validator's regex at `agent_bench/security/output_validator.py` | |
| line 23 is `\bAIza[0-9A-Za-z_\-]{35}\b` β byte-for-byte identical | |
| to GitHub's secret-scanning Google API Key detection pattern. | |
| This means there is no static test fixture that satisfies the | |
| validator's test assertion (the validator must block the input) | |
| without also triggering GitHub's push protection. Any replacement | |
| with a fixture that matches the validator's regex is immediately | |
| re-flagged; any replacement with a fixture that does not match | |
| the validator's regex breaks the test assertion. The cleanest | |
| resolution is to remove the Google fixture from the static | |
| parametrize list entirely and restore Google API key format | |
| coverage via a runtime-generated fixture that constructs a | |
| 35-char `AIza`-prefixed string at test time and never lands as | |
| a literal in source code. Tracked as a parallel-tracks item. | |
| The output validator's regex is NOT weakened; the test loses | |
| one of seven parametrize cases but continues to verify OpenAI, | |
| Anthropic, AWS, JWT, and env-var-assignment detection. | |
| 4. **Round-2 filter-repo.** Ran | |
| `git filter-repo --replace-text <file> --force` with the pattern | |
| file containing `regex:AIza[A-Za-z0-9_\-]{35}==>AIzaFIXTUREREDACTED`. | |
| This replaced the Google API key format anywhere it appeared | |
| in any historical blob across the entire repository. Every | |
| commit from `8ebe3964af7d` forward was rewritten, which | |
| cascaded through the full post-round-1 history including all | |
| round-1-remapped SHAs and tonight's 5 commits. Total commits | |
| processed: 186. filter-repo's internal commit-map wrote 152 | |
| changed entries and 35 unchanged entries (commits before | |
| `8ebe3964af7d` that never touched the pattern). | |
| 5. **Working-tree fixture removal.** After the filter-repo rewrite, | |
| `tests/test_output_validator.py` line 152 read | |
| `"google says AIzaFIXTUREREDACTED"` (15 chars after `AIza`, | |
| below the validator's 35-char regex threshold). Removed the | |
| line entirely from the parametrize list and added a block | |
| comment explaining the removal, the regex-collision reason, | |
| the parallel-tracks item to restore via runtime-generated | |
| fixture, and an explicit note that the validator's regex | |
| remains unchanged. Committed as a separate new commit on top | |
| of the rewritten history. | |
| 6. **Round-2 verification sweep.** Re-ran the same six-check | |
| sweep: `git log`, `git rev-list --all --objects`, reflog, | |
| fsck, stash, and a precise regex scan across all blobs for | |
| the `\bAIza[0-9A-Za-z_\-]{35}\b` pattern. **Zero blobs** in | |
| the post-round-2 object database contain a 35-char `AIza` | |
| pattern. The scrub is complete across all history. | |
| 7. **Round-2 DECISIONS.md SHA remap.** The round-1 remap table | |
| above uses SHAs `213da36`, `125dac0`, `5c1f49f`, `4454894` | |
| as the "NEW (post-rewrite)" column. These are the | |
| **post-round-2** SHAs; they were `e6d9675`, `c1d8163`, | |
| `740c9d5`, `6d177ba` after round 1 and got rewritten again by | |
| round 2. To avoid a three-column mapping table showing | |
| intermediate round-1 SHAs, the table above reads as a direct | |
| pre-rewrite β current-state mapping. The round-1-only | |
| intermediate SHAs are preserved in this narrative as | |
| "round-1 SHAs" for audit completeness but are not the | |
| canonical SHAs anyone looking up a commit should use. The | |
| canonical SHAs are the post-round-2 values. | |
| **Additional round-2 SHA update:** parallel-tracks item #7 | |
| (Fix 2 revert commit SHA missing from the Fix 2 outcome entry) | |
| was updated from `8c836f5` (post-round-1) to `27c2e17` | |
| (post-round-2). | |
| **Exposure scope, round 2.** The branch had been pushed to origin | |
| exactly once before round-2 was discovered (the first push at the | |
| end of round 1, which landed commit `3167b59` at origin). The | |
| feature branch was the only affected ref β `main` was not updated, | |
| and no PR had been merged. The round-2 cleanup requires a | |
| force-push with `--force-with-lease` to overwrite the pushed | |
| round-1 history with the round-2 history. Force-push is normally a | |
| discipline concern, but here it is safe: the branch was published | |
| less than one hour before round-2 was discovered, no other work | |
| was based on the pushed round-1 history, and the force-push is | |
| scoped to this specific branch (not `main` or any long-lived ref). | |
| **Alert dismissal.** GitHub alert #1 was dismissed as | |
| `false_positive` via `gh api` after the force-push, with the | |
| resolution comment noting that the pre-round-2 commit SHA the | |
| alert referenced (`8ebe3964af7d`) no longer exists in the | |
| rewritten history and the test fixture has been removed from | |
| `tests/test_output_validator.py` pending a runtime-generated | |
| replacement. | |
| **Round-2 procedural lesson.** The validator-regex β detector-regex | |
| identity is a structural finding worth noting for future security | |
| test design. Any test fixture that verifies detection of a | |
| specific secret format will, by construction, match the format | |
| it is testing. If the format is one GitHub (or any upstream | |
| detector) also scans for, the fixture will trigger an alert on | |
| every push where it is introduced. The three durable mitigations | |
| are: (a) generate fixtures at runtime so they never land in source, | |
| (b) use an isolated regex that is a proper subset of the production | |
| detector's regex so fixtures fall below the detector's match | |
| threshold, or (c) mark the file explicitly in a | |
| `.github/secret-scanning.yml` allowlist. This project is adopting | |
| option (a) as the follow-up, because it preserves the production | |
| detector regex without weakening and keeps the test's fidelity to | |
| the actual attack surface. | |
| ## Audit-path bug β streams masked a request-crashing failure (2026-04-15) | |
| During v1 deploy smoke testing, every non-stream `POST /ask` request | |
| and every injection-blocked request on the HF Space returned | |
| `{"detail": "Internal server error"}` (HTTP 500) instead of the | |
| intended 200 / 403. Normal queries via `POST /ask/stream` appeared to | |
| work correctly from the dashboard, so the bug was invisible until a | |
| direct curl hit a non-stream endpoint. Discovered and fixed same-day. | |
| **Root cause.** `agent_bench/security/audit_logger.py:60` called | |
| `self.path.parent.mkdir(parents=True, exist_ok=True)` the first time | |
| a request wrote an audit record. The default path `logs/audit.jsonl` | |
| resolved to `/home/user/app/logs/audit.jsonl` at runtime. The | |
| Dockerfile's `WORKDIR /home/user/app` creates that directory as | |
| **root:root mode 0755** because Docker's WORKDIR directive does not | |
| honor `--chown`. Subsequent `COPY --chown=user` lines only change | |
| ownership of the copied files, not the directory itself. At runtime | |
| under `USER user` (uid 1000), the process had `r-x` on WORKDIR β it | |
| could read and execute, but not create new subdirectories. First | |
| audit write β `PermissionError: [Errno 13] Permission denied: 'logs'`. | |
| **Why `.cache/` worked and `logs/` didn't.** `.cache/` is created at | |
| build time by the two `RUN python scripts/ingest.py ...` steps | |
| (running as root) and then explicitly chowned by | |
| `RUN chown -R user:user .cache/`. `logs/` was created lazily at | |
| runtime by the audit logger, as a non-root user, in a directory owned | |
| by root. Directory-creation permission wall. | |
| **Why the streaming endpoint masked it.** `/ask/stream` calls | |
| `_write_audit` at the *end* of the event generator | |
| (`routes.py:438`), after all stage events and the final answer have | |
| already been yielded over SSE. A failure there is already too late | |
| to affect the client's view of the response β the client sees the | |
| answer, the stream ends, and the audit entry is silently missing. | |
| Non-stream `/ask` calls `_write_audit` synchronously before `return` | |
| (line 263), so the failure propagates up through the middleware | |
| exception handler and becomes a visible 500. Injection-blocked | |
| requests on both endpoints also audit synchronously before returning | |
| (lines 193, 302) and produce the same visible 500. Normal dashboard | |
| use hits `/ask/stream` β symptom invisible; smoke testing | |
| non-stream `/ask` β symptom obvious. | |
| **Traceback confirmed from the HF Space runtime log:** | |
| ``` | |
| File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream | |
| _write_audit( | |
| File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit | |
| audit_logger.log(record) | |
| File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log | |
| self.path.parent.mkdir(parents=True, exist_ok=True) | |
| PermissionError: [Errno 13] Permission denied: 'logs' | |
| ``` | |
| **Two-fix patch.** Both applied in the same deploy cycle: | |
| 1. **Dockerfile (proximate).** Commit `ca34ccb`: | |
| `RUN mkdir -p logs && chown -R user:user .cache/ logs/` at build | |
| time, before `USER user`. Ensures the runtime user owns the | |
| audit-log directory from container start. | |
| 2. **`AuditLogger.log()` (architectural, defense-in-depth).** Commit | |
| `25e0f1b`: wrap the entire write body in `try/except Exception`, | |
| log the failure via structlog as `audit_write_failed` with the | |
| exception type and path, return normally. | |
| **Why both, not either alone.** | |
| - **Only fix 1:** unblocks the Space but leaves the architectural | |
| bug β an audit logger that can crash the app is misdesigned | |
| regardless of filesystem permissions. The SECURITY.md OWASP LLM10 | |
| writeup would be structurally weak. | |
| - **Only fix 2:** stops the crashes, but audit writes would still | |
| silently fail on HF Spaces because `logs/` remains unwritable. The | |
| OWASP LLM10 claim would be *live-untrue* on the demo surface. | |
| - **Both:** demo is honest (audit log actually written on HF Space), | |
| architecture is sound (audit failures degrade gracefully), and the | |
| OWASP LLM10 writeup has live evidence backing both claims. | |
| **Verification on the live Space after the patch.** `GET /health` | |
| flipped from `"degraded"` to `"healthy"` with `provider_available: | |
| true`; `POST /ask` normal returned 200 with grounded answers citing | |
| real sources from both corpora (was 500); `POST /ask` with the | |
| `ignore_previous` injection pattern returned 403 | |
| `"Request blocked: potential prompt injection detected"` (was 500); | |
| the runtime log shows `injection_detected pattern=ignore_previous | |
| tier=heuristic` followed by a 403 status and no `audit_write_failed` | |
| entries, confirming the audit write succeeded end-to-end. | |
| **Methodology lesson β streaming endpoints can hide backend failures.** | |
| Any request-path side effect (audit logging, metrics, cache write, | |
| persistence) that runs at the end of a streaming generator will fail | |
| silently from the client's perspective, because content is already | |
| committed to the wire before the failure fires. Testing such side | |
| effects against their failure modes *through* a streaming happy path | |
| is not sufficient β the happy path masks the failure. Every | |
| request-path side effect needs at least one test that verifies it | |
| fails loudly (or, where the design says graceful, degrades loudly) | |
| when its underlying dependency is broken, **exercised synchronously, | |
| not via the streaming code path.** | |
| For this project, `AuditLogger.log()` should gain a unit test that | |
| asserts it returns normally (not raises) when its target directory is | |
| unwritable β i.e., a regression test for the `try/except` fix above. | |
| And the dependency injection of the audit logger in the route | |
| handlers should get an integration test that exercises both the | |
| streaming and non-streaming paths against the same failure injection, | |
| to assert the streaming path does not hide what the non-streaming | |
| path surfaces. Tracked as a v1.1 test-coverage follow-up. | |
| **Why this wasn't caught locally.** Existing | |
| `tests/test_audit_logger.py` tests use pytest's `tmp_path` fixture, | |
| which is always writable by the test process. The failure mode only | |
| manifests when the runtime user cannot create the log directory, a | |
| deployment-environment condition not reproducible in a unit test | |
| against a local filesystem owned by the developer. An integration | |
| test that runs the Docker image end-to-end and hits `/ask` against | |
| the actual container filesystem would have caught it pre-deploy. | |
| Such a test is out of scope for v1 (adds ~5 min to CI plus Docker | |
| build infrastructure) but is the right long-term mitigation for this | |
| class of bug. | |
| ## LLM-judge layer supersession β discrete-anchored 2-judge jury replaces continuous-score single-call | |
| The continuous-score single-call judges in `agent_bench/evaluation/metrics.py` | |
| (`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted | |
| and replaced by the per-dimension Judge layer at | |
| `agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle. | |
| **Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`. | |
| **Why this is a supersession, not a refactor.** The new layer differs from | |
| the old on six axes: discrete-anchored scale (vs continuous 0β1), | |
| reasoning-before-score JSON ordering (vs score-first), per-dimension | |
| judges (vs combined faithfulness/correctness), full provenance per call | |
| (judge_id + rubric_version + system_output_hash + prompt_seed; old had | |
| none), composable variance wrappers (rubric_permute, jury β old was | |
| single-call), and an intentional abstain-vs-raise discipline (vs silent | |
| `None` from a bare `except Exception`). | |
| **Evidence backing the supersession claim** β the calibration ΞΊ table | |
| quantifies the new layer's agreement with hand-labels across 6 ablation | |
| rows (baseline + 3 variance ablations + permute + 2-judge jury). The | |
| files defending this entry's claim, by file path: | |
| - `measurements/2026-05-04-judge-calibration-labels.jsonl` β 30 items Γ 3 | |
| dimensions hand-labeled (UK AISI bio/chem ΞΊ ~0.8 cited as the | |
| literature ceiling). Lands in Phase 10. | |
| - `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`, | |
| `_baseline_no_anchors.json`, `_baseline_no_abstain.json`, | |
| `_permute.json`, `_jury_kappa_weighted.json` β per-row predictions. | |
| Land in Phase 11. | |
| - `docs/_generated/kappa_table.md` β generated ΞΊ ablation table copy- | |
| pasted into the writeup. Lands in Phase 11. | |
| - `docs/judge-design.md` β interpretive writeup with the closing | |
| "when NOT to use LLM-judge" position. Lands in Phase 12. | |
| **Config-knob preservation.** `evaluation.judge_provider` is unchanged | |
| across all 5 YAML configs; new `evaluation.judge_dimensions` field | |
| defaults to the three v1 dimensions. Zero user-facing config migration. | |
| **Out of scope (v1.1+).** Mistral self-hosted as the third jury member, | |
| Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2 | |
| prompt optimization, citation_faithfulness in the default | |
| judge_dimensions, AC2 sympy-derived parity tests. | |
| ## Opus stress-test surfaced groundedness rubric-scope drift before the ΞΊ ablation ran β 2026-05-05 | |
| The Opus stress-test pass over the 30 calibration items Γ 3 dimensions | |
| disagreed with the single-rater human gold on **22 of 30 groundedness | |
| items** (8/30 agreement). Relevance and completeness agreed at 28/30 and | |
| 25/30 respectively. The groundedness disagreement is consistent in | |
| direction β every disagreed-on item is `human=1, opus=0` β and has a | |
| single root cause. | |
| **Root cause: reference-scope drift between rubric author and labeler.** | |
| `agent_bench/evaluation/rubrics/groundedness.md` defines the reference | |
| scope as the gold snippets attached to each item: | |
| > The judge sees only the gold snippets β not the retrieved chunks. A | |
| > claim that happens to be true in the world but is not entailed by the | |
| > snippets fails groundedness. | |
| The single-rater notes on the disagreed-on items describe checking | |
| against the broader documentation, not against `source_snippets`: | |
| "supported by the corpus", "supported by the docs", "supported by the | |
| provided dependency snippet". For items like `k8s_006` the gold snippet | |
| is one sentence ("A ConfigMap is an API object used to store | |
| non-confidential data in key-value pairs"), while the agent's answer | |
| correctly synthesizes seven or eight additional claims from the full | |
| `k8s_configmap.md`. Those claims are true in the world and well-supported | |
| by the full doc, but **not entailed by the one snippet**. Opus applied | |
| the strict-snippet rubric; the human rater applied a corpus-supported | |
| rubric. | |
| **Why this blocks `make calibrate` against the current gold.** The ΞΊ | |
| ablation compares Haiku and GPT-4o-mini judges against the human gold. | |
| A judge that correctly applies the strict-snippet rubric will disagree | |
| with miscalibrated gold; a judge that's too lenient will agree. The | |
| ablation rewards leniency and punishes rigor β the opposite of the | |
| intended measurement. This is the same failure mode codified earlier in | |
| this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune | |
| compensation when the measurement is wrong, not the intended effect. | |
| **Why the rubric stays as written, not relaxed to "corpus-supported".** | |
| Strict-snippet groundedness measures *RAG behavior*: did the agent | |
| synthesize from what it retrieved? Corpus-supported groundedness | |
| measures *LLM general knowledge passing through a RAG harness*: did the | |
| agent happen to be correct? The first is what this benchmark is for; | |
| the second is what `agent_bench/evaluation/metrics.py` measured before | |
| supersession. Relaxing the rubric to "corpus-supported" would silently | |
| re-introduce the failure mode the supersession entry above just removed. | |
| **Decision β three-step correction lands before `make calibrate` runs:** | |
| 1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.** | |
| Add an explicit reference-scope line and one anchored example | |
| contrasting "supported by the snippet" vs "true in the world but | |
| not in the snippet". Audit-trail requirement: the v1.1 writeup will | |
| cite "rubric clarified between v1.0 and v1.1", and the git history | |
| needs to back that claim. | |
| 2. **Re-label the 22 disagreed-on groundedness items** in | |
| `measurements/2026-05-04-judge-calibration-labels.jsonl` against the | |
| clarified rubric, snippet-only. **Do not mechanically copy Opus's | |
| labels.** The labels remain the human single-rater's; what changes is | |
| the rubric being applied. Mechanical copy would turn the ΞΊ table | |
| into "judge vs Opus", which is not what the writeup claims it | |
| measures. | |
| 3. **Recompute `make calibrate` against the corrected gold** and emit | |
| `docs/_generated/kappa_table.md` from the v1.1 labels. | |
| **Evidence files for the v1.1 writeup section:** | |
| - `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` β 90 Opus | |
| labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min | |
| org rate limit, ~$0.20, ~14 min wall, zero infra-abstains). | |
| - `measurements/2026-05-04-judge-calibration-labels.jsonl` β original | |
| v1.0 single-rater gold; will be diffed against v1.1 corrected gold to | |
| quantify the re-label delta. | |
| - `agent_bench/evaluation/rubrics/groundedness.md` β pre/post diff is | |
| the rubric clarification. | |
| **Pre-labeling observations also worth recording for the writeup | |
| methodology section:** | |
| - `q021` (fastapi Β· calculation) answered the CORS preflight question | |
| correctly (600 / 60 = 10 minutes) with `sources: []` and | |
| `ranked_sources: []` β the agent did the arithmetic without retrieval | |
| and emitted an answer consistent with the snippet without having | |
| retrieved it. Methodologically interesting for the | |
| citation-faithfulness story (Block 2.7) if it ships: an answer can be | |
| correct without being grounded-by-citation. | |
| - `q025` (fastapi Β· multi_hop) answer was truncated mid-token by the | |
| orchestrator's max_tokens limit. The labels reflect what the system | |
| produced, not a mentally-patched complete version. The completeness | |
| rubric does not currently anchor "truncated response" as a level β | |
| v1.1 rubric work should add an anchor. | |
| - Several K8s items embed external knowledge that's correct but not in | |
| the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container | |
| success; `k8s_009` describes Roles vs ClusterRoles by their semantics). | |
| The clarified groundedness rubric should pick **strict** on this case | |
| (claim must be supportable by the retrieved spans, not just consistent | |
| with them) and the anchored example should show that ruling. | |
| **Methodology framing for the writeup.** The Opus stress-test was added | |
| specifically to catch hand-labeled-gold fragility before the ΞΊ table is | |
| published. It caught it. The writeup's calibration section should | |
| disclose the rubric clarification, quantify the re-label delta on | |
| groundedness, and report ΞΊ against the v1.1 corrected gold β that is a | |
| more credible story than a first-try clean ΞΊ table would have been. | |
| **Outcome β 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows | |
| ran cleanly after three coupled production-code fixes that landed on the | |
| same branch as the rubric clarification: (1) markdown fence stripping in | |
| `agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because | |
| Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens` | |
| 512 β 1024 because v1.1 anchored examples elicit longer model reasoning, | |
| (3) calibration runner v1.0 omitted `item_id` from prediction records; | |
| fixed in v1.1 with backfill of the 6 already-written row files via | |
| `hash β item_id` map (no re-spend). Probe-one-cell-before-sweep saved a | |
| fourth $0.50 wasted run after the fence-strip change β the methodology | |
| note in `feedback_judge_probe_before_sweep.md` was earned by this | |
| session's two failed full-row attempts that paid ~$1.15 for unparseable | |
| output before the diagnosis converged. | |
| The ΞΊ table at `docs/_generated/kappa_table.md` (regenerated on | |
| 2026-05-05 with AC1 for groundedness and relevance, Cohen's ΞΊ for | |
| completeness β see report.py `_DIM_METRIC`) shows three findings | |
| that the writeup interprets rather than reports verbatim: | |
| **v1.1 finding 1 β relevance is not "judges fail" territory.** | |
| Cohen's ΞΊ = 0 across 5/6 rows is a prevalence degeneracy on the | |
| 29Γscore=2 + 1Γscore=1 gold; raw agreement is 96β100%, AC1 is 0.96β1.00. | |
| AC1 is the load-bearing statistic on relevance and groundedness; both | |
| metrics agree on completeness where the gold (23Γ2 / 5Γ1) is balanced. | |
| **v1.1 finding 2 β `no_cot completeness` agreement is real, not | |
| selective abstain.** AC1 = ΞΊ = 1.000 at n=24. The 2 absent cells | |
| (`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit | |
| retry exhaustion), both gold=`2`, neither in baseline's disagreement | |
| set. On the 24 scored cells, all 4 baseline-with-CoT disagreements | |
| (3Γ gold=2 scored 1 by CoT-judge, 1Γ gold=1 scored 2) flip to | |
| agreement when CoT is removed. The interview-relevant claim is the | |
| *opposite* of the conventional CoT-helps story: CoT-before-score on | |
| 3-point completeness lets the judge over-emphasize partial coverage | |
| and rationalize `1` when the human gold sides with the holistic | |
| "covers the points" reading. | |
| **v1.1 finding 3 β `jury_kappa_weighted` underperformed baseline on | |
| completeness, with a precise mechanism.** Per-member analysis from | |
| `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`: | |
| Haiku-4.5 alone reaches ΞΊ = 0.416 / AC1 = 0.792 / raw 84.6%; | |
| gpt-4o-mini-2024-07-18 alone reaches ΞΊ = 0.020 / AC1 = 0.006 / raw | |
| 26.9% β systematically harsh on the 3-point scale, almost never | |
| scoring `2`. Jury aggregate ΞΊ = 0.014 / AC1 = 0.016 / raw 26.9% β | |
| matches gpt-4o-mini alone exactly because the jury verdict reduces | |
| to gpt-4o-mini's verdict on every disputed cell. | |
| The mechanism is *missing-weight + round-down* compounding, not | |
| weighted voting in the usual sense. `scripts/run_calibration.py | |
| ::_load_weights_from_baseline` is a documented v1 stub that returns | |
| weight = 1.0 for every judge_id present in baseline. baseline.json | |
| contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini | |
| gets 1.0 from `jury.py`'s missing-key fallback (with a logged | |
| `jury_missing_weight_fallback_to_one` warning per call). Equal | |
| weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean | |
| of 1.5; the `_discretize_mean` rule is `frac > 0.5 β ceil else floor`, | |
| and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict | |
| wins every disputed cell. The v1 design doc's risks subsection listed | |
| "jury ΞΊ worse than the better individual judge β (a) kappa-weighting | |
| wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired | |
| *both* branches simultaneously: branch (a) because the weighting is a | |
| stub returning equal weights, and branch (b) because round-down at | |
| exact 0.5 ties hands the verdict to the lower-scoring member. | |
| The deeper structural point is that weighting alone cannot rescue a | |
| systematically miscalibrated member. Even held-out validation that | |
| correctly assigned gpt-4o-mini's true low weight on completeness | |
| would still let it dominate disputed ties unless its weight were | |
| driven near zero β and at that point exclusion is more honest than | |
| near-zero inclusion. The conservative-on-binary "ties to lower" rule | |
| also doesn't transfer cleanly to ordinal scales: on completeness, | |
| "conservative" means scoring *toward incomplete*, which is precisely | |
| the direction of gpt-4o-mini's bias. | |
| **v1.2 fix list (four items, expanding the earlier two-item list):** | |
| 1. **Held-out jury weights.** Replace the | |
| `_load_weights_from_baseline` stub with a real ΞΊ-derived | |
| computation, evaluated on a *held-out validation set* β not the | |
| same calibration row whose ΞΊ is being measured against the gold. | |
| Closes the circular-weighting hole. | |
| 2. **Symmetric member coverage in the weights source.** Missing-member | |
| fallback to weight = 1.0 amplifies an unweighted member rather than | |
| suppressing it. Either every jury member must have a weight in the | |
| source file or the run must abort. The `jury_missing_weight_ | |
| fallback_to_one` warning fired loudly on every call this run; in | |
| v1.2 it should be a hard error. | |
| 3. **Per-dimension member exclusion when individual ΞΊ falls below a | |
| threshold.** gpt-4o-mini at ΞΊ = 0.020 on completeness should not be | |
| in the completeness jury at all. Weights below a floor (suggested | |
| ΞΊ < 0.2) should be treated as exclusion, not as small-weight | |
| inclusion. Held-out validation fixes circular weighting; it does | |
| not fix systematic member bias. | |
| 4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule | |
| (ties to lower) was selected for conservative behavior on binary | |
| scales, where "conservative" means scoring 0 on uncertainty. On | |
| 3-point completeness, "conservative" means scoring toward | |
| *incomplete*, which interacts badly with member miscalibration. | |
| v1.2 should select the tie-break rule per-dimension based on the | |
| rubric's conservative direction, not globally. | |
| **Evidence files:** `docs/_generated/kappa_table.md` (regenerated with | |
| AC1 for groundedness/relevance, ΞΊ for completeness); | |
| `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl` | |
| (per-member sidecar where the gpt-4o-mini completeness bias is | |
| visible per item); `results/calibration_v1_judge_baseline.json` | |
| (weights source β note the absence of any gpt-4o-mini-2024-07-18 | |
| entries, which is why the missing-weight fallback fires). | |
| ## v1.1 jury rescue β sharpened diagnostic + pre-committed A+B success criteria | |
| **Date:** 2026-05-06. **Status:** in-flight; this entry is the pre-experiment | |
| contract that pins down what counts as success before the re-aggregation | |
| runs, so the outcome can't be negotiated post-hoc. | |
| **Sharpened diagnostic β extraction-vs-reasoning split, not just "model is | |
| biased".** Re-reading the per-member sidecar (item-level, not aggregate) | |
| on the gpt-4o-mini completeness disputes shows a more specific failure | |
| mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2 | |
| / gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes` | |
| field correctly extracts the paraphrased coverage from the agent answer | |
| β and then its `reasoning` field denies that those very quotes constitute | |
| coverage. k8s_002 is the cleanest instance: the model quotes the strings | |
| "declarative updates" and "sticky identity" into evidence, then writes | |
| "the answer does not explicitly mention 'declarative updates' and 'sticky | |
| identity'". The score follows the reasoning, not the evidence. The | |
| mechanism is that the model's *post-extraction reasoning step* applies a | |
| literal-string-match standard to the answer text while the rubric | |
| requires "paraphrase allowed" β i.e., the structured-output discipline | |
| forced an extraction step that the reasoning step then contradicted on | |
| autopilot. This is a known failure mode in chain-of-thought judges and | |
| shows up more in smaller models because the reasoning step has less | |
| capacity to integrate the rubric's instruction with the literal-text | |
| comparison the model is running by default. The artifact for the writeup | |
| is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three | |
| side-by-side reasoning + evidence_quotes excerpts). | |
| **Pragmatic v1.1 weights-source decision.** The v1.2 fix-list above | |
| specifies a held-out validation set for jury weights β methodologically | |
| clean but requires either splitting N=30 (loses statistical power on | |
| both halves) or labeling more items (eats interview prep time). v1.1 | |
| chooses pragmatic: weights computed from the same calibration set used | |
| for ΞΊ reporting, with the circularity flagged in the writeup. Reason: | |
| (a) the alternative is splitting N=30, (b) the per-member ΞΊ values used | |
| as weights are internally consistent, (c) v1.2 will use a held-out 20- | |
| item set. The writeup will contain a sentence acknowledging the | |
| circularity rather than hiding it. | |
| **v1.1 elevated fix-list (subset of the v1.2 list above).** Items 2 | |
| (symmetric coverage / hard-error) is elevated unconditionally. Item 1 | |
| (real ΞΊ-derived weights) is elevated in pragmatic form (same set with | |
| circularity caveat). Items 3 (per-dimension exclusion) and 4 (per- | |
| dimension tie-break) remain v1.2 unless B's outcome forces them up. | |
| **Pre-committed B success criteria.** Plan B is "re-aggregate the existing | |
| 164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl` | |
| with corrected ΞΊ-derived weights, no new API spend." The outcome maps | |
| deterministically to one of three predefined responses, picked *before* | |
| B runs: | |
| - **Outcome 1 β jury ΞΊ on completeness exceeds Haiku-baseline ΞΊ by β₯ | |
| 0.05** (i.e., new jury ΞΊ β₯ 0.466, vs Haiku-alone 0.416). Writeup story: | |
| "v1's weights-source bug masked correct aggregation; once both bugs | |
| (asymmetric coverage + missing-weight fallback) are fixed, the jury | |
| improves on baseline. Per-dimension exclusion remains a v1.2 design | |
| pattern but is not needed at v1.1." This is the strong story. | |
| - **Outcome 2 β jury ΞΊ within Β±0.05 of Haiku-baseline** (i.e., 0.366 β€ | |
| jury ΞΊ β€ 0.466). Writeup story: "weights-source fix recovers parity | |
| but the jury isn't doing meaningful work on completeness β gpt-4o- | |
| mini's near-zero weight makes it effectively excluded by aggregation. | |
| This is *soft exclusion via weighting*; v1.2 will make exclusion | |
| explicit." Defensible but less clean. | |
| - **Outcome 3 β jury ΞΊ falls below Haiku-baseline ΞΊ by >0.05** (i.e., | |
| jury ΞΊ < 0.366). Writeup story: "weights-source fix is necessary but | |
| not sufficient; even at near-zero weight gpt-4o-mini's verdict tips | |
| disputed (1, 2) ties due to the round-down rule. v1.1 escalates to | |
| per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1. | |
| **Why the predefined-criteria framing matters.** "I ran B, looked at the | |
| number, decided it was good enough" is the same data with a weaker frame | |
| than "I predefined the success criteria before running the experiment, B | |
| landed at outcome X, which mapped to predefined response Y". The latter | |
| demonstrates evaluation maturity in the writeup; the former invites | |
| post-hoc reading of the outcome. | |
| **B outcome β 2026-05-06.** Plan B re-aggregated the existing 164 sidecar | |
| rows with ΞΊ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on | |
| completeness; clipped at 0 from raw ΞΊ values). Result: **jury ΞΊ on | |
| completeness = 0.416**, exactly matching Haiku-baseline. Ξ = 0.000; | |
| maps to **Outcome 2 (soft exclusion via weighting)**. Per the | |
| pre-committed response, v1.1 stops here and writes up; per-dimension | |
| member exclusion (item C / v1.2 fix #3) is not escalated to v1.1. | |
| Mechanism, validated empirically β a disputed cell (Haiku=2, gpt=1) | |
| with corrected weights aggregates as `(2 Γ 0.416 + 1 Γ 0.020) / 0.436 = | |
| 1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the | |
| correct verdict. v1's two compounding bugs (asymmetric source returning | |
| weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt- | |
| 4o-mini) jointly forced equal weights, and equal-weights with the same | |
| round-up rule produced `(2 Γ 1 + 1 Γ 1) / 2 = 1.5`, which has frac | |
| exactly 0.5 (not > 0.5), and floored to 1 β gpt's verdict winning every | |
| disputed cell. The bug fixes recover the right verdict purely | |
| mechanically; no judge model behavior changes. | |
| The empirical reading: the weighting is *not doing meaningful work* β | |
| gpt-4o-mini's near-zero weight effectively excludes it on completeness, | |
| and the jury's ΞΊ matches Haiku-alone exactly because Haiku's verdict | |
| wins every disputed cell. This is "soft exclusion via weighting"; v1.2's | |
| explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the | |
| exclusion visible in the jury config rather than emergent from ΞΊ-derived | |
| weight collapse. | |
| **v1.1 code changes (this commit):** | |
| - `agent_bench/evaluation/variance/jury.py` β silent missing-weight | |
| fallback to 1.0 β hard `ValueError`. Two existing tests that asserted | |
| the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`, | |
| `test_kappa_weighted_logs_warning_on_missing_weight`) updated to | |
| assert the new contract. | |
| - `scripts/run_calibration.py::_load_weights_from_baseline` β | |
| `_compute_kappa_weights` β replaces the v1 stub with real per-judge | |
| Cohen's ΞΊ on the dimension; hard-errors when any expected member is | |
| missing from the source. Clips ΞΊ < 0 to weight = 0 (soft exclusion). | |
| - `configs/calibration/rows/jury_kappa_weighted.yaml` β `weights_source` | |
| re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only, | |
| asymmetric coverage) to | |
| `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges, | |
| same calibration set with documented circularity). | |
| - `tests/scripts/test_run_calibration_dispatch.py` β two new tests cover | |
| `_compute_kappa_weights`: (a) computes real ΞΊ (high-agreement judge β | |
| weight=1.0, chance-agreement judge β 0); (b) hard-errors on | |
| asymmetric source coverage. | |
| - `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` β new | |
| predictions row produced by re-aggregating the existing sidecar | |
| offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`). | |
| `docs/_generated/kappa_table.md` regenerated with this row alongside | |
| the broken v1 row, giving the writeup a clean before/after diff | |
| (completeness: 0.014 β 0.416, n=26). | |
| - `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` β the | |
| three side-by-side reasoning + evidence_quotes excerpts (q006 / | |
| k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split | |
| diagnostic finding. | |
| The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1 | |
| (pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2. | |
| ## Plan 3A β recency-positioned paraphrase instruction (pre-committed criteria) | |
| **Date:** 2026-05-06. **Status:** in-flight; this entry pins down the | |
| hypothesis and success criteria before the experiment runs. | |
| **Hypothesis sharpened by the 1A direction-of-bias finding.** GPT-4o- | |
| mini's completeness disagreements are 17/19 gold=2/pred=1 with zero | |
| up-mistakes across 26 items spanning two corpora β direction-aware noise, | |
| not balanced random labeling. The model is consistently applying *some* | |
| rule stricter than the rubric requires. The hypothesis under test: that | |
| stricter rule is "literal-string match required, paraphrase doesn't | |
| count," and the bias is fixable by recency-positioning the rubric's | |
| "paraphrase allowed" instruction adjacent to the commit-to-score | |
| decision instead of leaving it 500+ tokens upstream in the rubric body. | |
| **The intervention is positional, not lexical.** The current | |
| `CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`) | |
| sends the rubric body, then the gold reference, then the system answer, | |
| then a one-line "Score this answer..." instruction immediately followed | |
| by the JSON schema clause. The rubric body's "paraphrase allowed" clause | |
| appears in the introductory paragraphs, hundreds of tokens before the | |
| score decision. The intervention adds one sentence between the system | |
| answer and the score instruction: | |
| > *"Note: a paraphrase that captures the same meaning as a gold-answer | |
| > point counts as covered. Score on content equivalence, not surface | |
| > form."* | |
| This is the recency-positioning hypothesis: the model loses the | |
| paraphrase conditioning across the rubric anchors and the reasoning | |
| step. Restating the instruction adjacent to the score decision tests | |
| whether the bias is positionally correctable. | |
| **Selected 5 disputed items** (representative of the gold=2 / Haiku=2 / | |
| gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`, | |
| `k8s_018`. All four are pure paraphrase-coverage cases (the system | |
| answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored | |
| 1 with the extraction-vs-reasoning split documented in | |
| `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`). | |
| **Pre-committed 3A success criteria.** | |
| - **Fixed (β₯3/5 shift from 1 β 2):** Recency-positioning is sufficient. | |
| Re-run GPT-4o-mini on the full 26 disputed items with the corrected | |
| prompt, recompute ΞΊ, update the writeup table. Story: "rubric- | |
| engineering matters more than judge model choice for ordinal scales β | |
| recency-positioning the paraphrase instruction recovered N% of | |
| disputed items." The completeness story becomes actionable, not | |
| diagnostic-only. | |
| - **Partially fixed (1β2/5 shift):** Inconclusive at N=5 (binomial- | |
| significance line is ~3+). Re-run on the full 26 disputed items | |
| (~$0.20) to get a clean number; write up whatever the full-26 says. | |
| - **Not fixed (0/5 shift):** The instruction is being received and | |
| ignored β the model can't act on it under reasoning load. Escalate | |
| to 4A (GPT-4o full on the same 5 items) to verify the small-model- | |
| specific claim. Story: "repositioning the paraphrase instruction | |
| adjacent to the score decision did not shift any of 5 disputed items; | |
| GPT-4o handled the same prompts. The bias is small-model-specific, | |
| not prompt-fixable." | |
| The 3/5 threshold is the binomial-significance line at this N β random | |
| shifting under the null produces 0 or 1 changes most of the time. Pre- | |
| committing avoids the "2 shifted, that's kind of a fix" negotiation. | |
| **On the 1A relevance finding β confirmed.** Both judges essentially | |
| correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); ΞΊ | |
| degeneracy is structural under 29/30 prevalence at class-2; AC1 + | |
| raw agreement is the right reporting. No further investigation on | |
| relevance. Writeup paragraph is one short sentence: prevalence-induced | |
| degeneracy β AC1 is load-bearing. | |
| ## Plan 3A β outcome on the 5-item probe + full-26 re-run (v1.1.1) | |
| **Date:** 2026-05-06. **Status:** complete; the v1.1.1 prompt is now | |
| permanent in `agent_bench/evaluation/judges/completeness.py`. | |
| **3A 5-item probe:** 3/5 disputed items shifted 1 β 2 (q006, q011, | |
| k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre- | |
| committed threshold (β₯3/5 β "fixed"), so the protocol triggered the | |
| full-26 re-run on gpt-4o-mini only (Haiku held as control to make the | |
| v1.1 β v1.1.1 delta cleanly attributable to the intervention's effect on | |
| the affected judge). | |
| **Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):** | |
| | | n | raw | ΞΊ | AC1 | | |
| |------------------------------|----|--------|--------|--------| | |
| | v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 | | |
| | **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** | | |
| | v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 | | |
| **Per-item delta (v1.1 β v1.1.1):** 7 items shifted up (1 β 2 or 1 β 2), | |
| 0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2 | |
| items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016, | |
| k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1 | |
| in v1.1, now over-credited at pred=2). Net per-item correctness delta: | |
| +5 items. | |
| **Cohen's ΞΊ is misleading on this comparison.** v1.1.1 raw agreement | |
| rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from | |
| 0.006 to 0.232 (38Γ improvement). But Cohen's ΞΊ stayed at ~0 β slightly | |
| *lower* than v1.1's 0.020. The mechanism is prevalence-rebalancing in | |
| the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19, | |
| 2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer | |
| to gold's `{1:5, 2:23}` over n=28). Cohen's ΞΊ = `(P_o - P_e)/(1 - P_e)`; | |
| when marginals become more diverse, P_e (chance agreement) rises in | |
| lockstep with P_o (observed agreement), and ΞΊ deflates. AC1 uses | |
| prevalence-robust chance correction (`P_e = (1/(q-1)) Ξ£ pi_k(1-pi_k)`) | |
| and reads the actual signal. | |
| This is the same trap that motivated AC1 over ΞΊ on the relevance and | |
| groundedness rows of the original ΞΊ table, surfacing here at a | |
| different distribution boundary. The ΞΊ table footer already explains | |
| why per-dimension metric selection matters; v1.1.1's outcome | |
| demonstrates the trap *induced by the intervention itself*. | |
| **Effect on the jury aggregate.** With ΞΊ-derived weights and gpt-4o- | |
| mini's v1.1.1 ΞΊ at 0 (clipped from +0.000 to weight=0), the jury | |
| verdict on completeness is now mathematically equivalent to Haiku-alone | |
| on every item (gpt's contribution is multiplied by zero). Jury ΞΊ stays | |
| at 0.416, identical to v1.1's corrected aggregate. The intervention's | |
| per-member improvement is *invisible at the jury level* under this | |
| weighting scheme. | |
| **Methodological consequence β v1.2 fix-list addition.** The v1.2 fix- | |
| list now expands by one item: | |
| 5. **Prevalence-robust weights for prevalence-skewed dimensions.** | |
| v1.1's `_compute_kappa_weights` uses Cohen's ΞΊ for every dimension, | |
| which has a *self-defeating property* on prevalence-skewed gold: | |
| improving a member can lower its weight even as it gets more | |
| accurate. | |
| **Mechanism.** Cohen's ΞΊ = `(P_o - P_e) / (1 - P_e)`, where | |
| `P_e = Ξ£_k P(gold=k) Γ P(pred=k)` is the chance-agreement term | |
| computed from the marginal distributions. P_e is *not* invariant to | |
| the predictor's marginal distribution β when a member's predictions | |
| become more diverse (less concentrated at one class), P_e *rises* | |
| as the marginals approach gold's marginals. Concretely: when an | |
| intervention moves a member's pred distribution from concentrated- | |
| at-one-class toward gold's distribution, P_o and P_e rise together | |
| in lockstep. The numerator `P_o - P_e` stays small, and ΞΊ deflates | |
| even as raw accuracy improves. This is the same prevalence-induced | |
| degeneracy that motivated AC1 over ΞΊ on relevance/groundedness rows | |
| in the ΞΊ table β it surfaces in jury weighting at any | |
| distribution-shifting intervention's boundary. | |
| **Empirically observed in v1.1.1.** The recency-positioning | |
| intervention shifted gpt-4o-mini completeness pred dist from | |
| `{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's | |
| `{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% β 42.9%. | |
| AC1 (Gwet 2008) reads the change correctly: 0.006 β 0.232 (38Γ). | |
| Cohen's ΞΊ stays at ~0 (0.020 β 0.000) because P_e is now β P_o | |
| β 0.43. v1.1's `_compute_kappa_weights` clips the new ΞΊ at zero, | |
| producing weight = 0 β and the jury aggregate loses access to a | |
| member that was empirically improved. The intervention's per- | |
| member improvement is invisible at the jury level under ΞΊ-weighting. | |
| **Architectural decomposition for v1.2.** The right separation: | |
| - **Per-dimension metric for ΞΊ table reporting** (already in v1.1 | |
| via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`). | |
| - **Per-dimension weight metric for jury aggregation** (new in | |
| v1.2, reuses `_DIM_METRIC`). Use ΞΊ where the gold's prevalence | |
| supports it, AC1 where ΞΊ degenerates. Same lookup, same per- | |
| dimension policy at both reporting and weighting layers. | |
| - **Per-dimension membership as explicit configuration override** | |
| for members that are structurally inappropriate (v1.2 fix #3, | |
| unchanged) β distinct from "low score on the chosen metric," | |
| which is handled by the weight floor. | |
| **Why this is non-obvious.** A reader's first instinct is that | |
| "weight by ΞΊ" is a sensible default β ΞΊ is *the* standard inter- | |
| rater statistic. The self-defeating property is invisible until | |
| you observe a real intervention that shifts marginals; in static | |
| conditions (no intervention, fixed prompts), the ΞΊ-weight choice | |
| is benign. The v1.1.1 outcome is the first time the agent-bench | |
| calibration set has produced an intervention-induced marginal | |
| shift on the same gold; the failure mode wouldn't have been | |
| visible in v1.0's static calibration sweep. | |
| **v1.1.1 code changes (this commit):** | |
| - `agent_bench/evaluation/judges/completeness.py` β adds | |
| `PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system | |
| answer and the score instruction. Comment cites the 3A probe. | |
| - `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt` | |
| β extends to assert the recency clause appears AND is positioned | |
| between the answer and the score instruction (position is load- | |
| bearing, not just lexical inclusion). | |
| - `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl` | |
| β merged sidecar: v1.1 groundedness/relevance rows (unchanged | |
| judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku | |
| completeness rows. | |
| - `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` β the | |
| 5-item probe artifact with reasoning + evidence_quotes for each. | |
| - `scripts/_dev/probe_3a_paraphrase_recency.py`, | |
| `scripts/_dev/rerun_completeness_v1_1_1.py` β reproducers; not | |
| part of the production calibration runner. | |
| **No changes to the ΞΊ table.** The jury aggregate ΞΊ on completeness is | |
| unchanged (0.416 β 0.416) because of the ΞΊ-as-weight degeneracy | |
| described above; adding a `jury_kappa_weighted_v1_1_1` row with | |
| identical numbers would be visual noise. The v1.1.1 finding lives in | |
| the writeup body, not the table β the per-member AC1 improvement | |
| (0.006 β 0.232) is the headline number, surfaced as a separate | |
| paragraph next to the ΞΊ table rather than inside it. | |
| **Total spend through Plan 3A:** $0.0013 (3A probe) + $0.0075 (full-26 | |
| re-run) = $0.0088. | |
| ## Plan 4A β GPT-4o (full) on the v1.1.1 residual | |
| **Date:** 2026-05-06. **Status:** complete. Run after the writeup- | |
| framing review surfaced that v1.1.1's "fixed" verdict was overclaim- | |
| prone β 5/19 items were recovered, 14 remained unchanged and | |
| uncharacterized. 4A was originally scoped as conditional on 3A *not* | |
| being fixed (per the predefined sequencing rule), but became valuable | |
| as a *post-3A* diagnostic to characterize the residual: is it small- | |
| model-specific or rubric-under-specified? | |
| **Scope.** GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged | |
| items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1 | |
| production prompt (paraphrase recency clause active). The first two | |
| (k8s_006, k8s_018) are the items that didn't shift in the original 3A | |
| 5-item probe β we have gpt-4o-mini's reasoning on those items *with* | |
| the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt | |
| varying only the model. q011, q012, k8s_001 cover the broader | |
| fastapi/k8s residual surface (k8s_001 also a Haiku miscall β 4A | |
| checks whether GPT-4o agrees with gold or with Haiku). | |
| **Result: 5/5 correct.** All 5 items scored 2 by GPT-4o, matching gold | |
| exactly. Cost: $0.0011 reported (caveat: pricing config falls back to | |
| gpt-4o-mini rates for unlisted models, so actual cost is closer to | |
| $0.005β0.01 β the reported number under-reports by ~5β10Γ). | |
| **Sharpened mechanism β criteria-invention, not just literal-match.** | |
| The original 3-example artifact (q006, k8s_002, k8s_018) was framed | |
| as gpt-4o-mini "applying a literal-string-match standard" while | |
| correctly extracting paraphrased coverage into evidence_quotes. 4A's | |
| side-by-side reasoning on `k8s_018` shows a distinct second mechanism: | |
| - **gpt-4o-mini (v1.1.1, score 1):** "It mentions some key points | |
| from the reference... but does not explicitly state that the new | |
| fields in `autoscaling/v2` are preserved as annotations when using | |
| `autoscaling/v1`, nor does it mention the need to use | |
| `autoscaling/v2` directly for memory or custom metric scaling for | |
| a Deployment or StatefulSet." | |
| - **gpt-4o (4A, score 2):** "The answer covers all the key points | |
| from the reference. It mentions that the current stable version is | |
| autoscaling/v2, which supports scaling on memory and custom | |
| metrics, similar to the reference. It also notes that | |
| autoscaling/v1 only supports CPU-based scaling, aligning with the | |
| reference's points." | |
| The reference for k8s_018 specifies three points: (1) autoscaling/v2 | |
| is the current stable API, (2) it adds memory metrics support beyond | |
| v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's | |
| reasoning step *invents additional criteria* the reference does not | |
| require ("preserved as annotations when using autoscaling/v1," "use | |
| autoscaling/v2 directly for ... a Deployment or StatefulSet") and then | |
| deducts against them, scoring 1. GPT-4o reads the reference's three | |
| points and scores against exactly those, scoring 2. | |
| This is a *capacity* finding distinct from the paraphrase-recency | |
| finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing | |
| it toward paraphrase semantics, manufactures additional gold criteria | |
| during scoring that aren't in the reference. Recency-positioning the | |
| "paraphrase allowed" clause doesn't address this β the bias isn't | |
| "missed paraphrase," it's "invented extra requirements." Two failure | |
| modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces. | |
| **Implication for v1.2.** With 5/5 confirmed, v1.2 fix #3 (per- | |
| dimension membership) gets clean empirical support: gpt-4o-mini is | |
| the wrong tool for 3-point completeness with paraphrase semantics, and | |
| no amount of prompt engineering on this rubric is going to bridge the | |
| capacity gap. The right v1.2 path is one of: | |
| - **Exclude gpt-4o-mini from completeness scoring** (per-dim | |
| membership; jury reduces to single-judge Haiku on completeness; | |
| explicit and visible in config). | |
| - **Replace gpt-4o-mini with GPT-4o on completeness** (per-dim | |
| judge selection; jury keeps two members but the second is a | |
| frontier-class model on the dimension that needs it). | |
| Both are defensible v1.2 designs. The choice depends on cost | |
| budget β gpt-4o is ~10Γ the per-call cost of gpt-4o-mini. For | |
| agent-bench's calibration set scale (~30 items Γ per-row), even gpt- | |
| 4o is trivially cheap; for production deployment evaluating thousands | |
| of agent outputs, the cost trade-off matters more. | |
| **4A artifact:** `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl` | |
| (per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs | |
| with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for | |
| the side-by-side analysis above). | |
| **Updated honest framing for the writeup.** "v1.1.1 addressed one | |
| identified failure mode (paraphrase-instruction-loss across reasoning, | |
| recovered 5/19 disputed items via positional change). 4A confirmed the | |
| residual 14 are a distinct failure mode (capacity-limited criteria | |
| invention during the reasoning step) β GPT-4o handles all 5 sampled | |
| residuals at the same v1.1.1 prompt, so the failure is small-model- | |
| specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge | |
| membership / model selection) is the right escalation; the rubric | |
| itself doesn't need changes." | |
| **Total session spend:** $0.0099 reported (~$0.013β0.018 actual after | |
| gpt-4o pricing correction). | |