Spaces:
Sleeping
docs: decisions for multi-corpus refactor
Browse filesSeven new entries appended to DECISIONS.md:
1. Per-corpus refusal thresholds β why a single global threshold is
wrong when FastAPI and K8s have different retrieval score
distributions.
2. Corpus Γ provider composition β why the nested
corpus_map[corpus][provider] structure exists and why flat would
silently break the provider toggle in multi-corpus mode. This is
the deviation from the original plan and the one architectural
call that actually matters for the demo's credibility.
3. Single parameterized system prompt β why one template beats two
per-corpus templates, and why the wording deliberately differs
from the typical "don't hallucinate" RAG prompt ("refuse
explicitly", "do not infer / extrapolate / general knowledge").
4. K8s curation targets recruiter-likely questions, not coverage β
policy for what's in data/k8s_docs/SOURCES.md and why
etcd-internals questions should be correctly refused.
5. No cross-corpus score comparison (BEIR principle) β concrete
consequences for this repo, including keeping the hero-tile
citation accuracy FastAPI-specific and avoiding combined-mode
aggregation in evaluate-fast.
6. K8s golden dataset uses the CRAG taxonomy β distribution across
simple fact / multi-hop / comparison / conditional / false-premise /
version-specific, with the schema v2 fields that support multi-hop
partial credit.
7. Cold-start contingency β measure first, lazy-load only if HF
Spaces cold-start exceeds 60 s. The lazy-load path is deliberately
not pre-built because the test surface for
lazy-loading + corpus routing + provider switching is non-trivial.
No code changes β DECISIONS.md only. Full suite still 421 passing,
ruff clean, mypy clean.
- DECISIONS.md +169 -0
|
@@ -353,3 +353,172 @@ reactive framework adds a dependency, interview questions about
|
|
| 353 |
"why is there a framework for 5 state variables", and indirection
|
| 354 |
that fights the imperative SSE pattern. One `state` object + a few
|
| 355 |
`render()` functions handles it in ~150 lines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 353 |
"why is there a framework for 5 state variables", and indirection
|
| 354 |
that fights the imperative SSE pattern. One `state` object + a few
|
| 355 |
`render()` functions handles it in ~150 lines.
|
| 356 |
+
|
| 357 |
+
## Why per-corpus refusal thresholds?
|
| 358 |
+
|
| 359 |
+
FastAPI and Kubernetes have different corpus characteristics. FastAPI
|
| 360 |
+
has 16 short, well-structured docs with sparse cross-references β
|
| 361 |
+
relevance tends to concentrate in 1-2 chunks per query. Kubernetes
|
| 362 |
+
has 30-40 docs with heavy cross-referencing between concepts (Pod β
|
| 363 |
+
Deployment β Service β Ingress), which spreads relevance across more
|
| 364 |
+
chunks. A single global refusal threshold would either refuse too
|
| 365 |
+
aggressively on K8s (no single chunk dominates, so the top score
|
| 366 |
+
looks "low") or not aggressively enough on FastAPI (where a
|
| 367 |
+
moderate-scoring chunk might be the only hit and should still refuse).
|
| 368 |
+
|
| 369 |
+
`CorpusConfig` carries `refusal_threshold` as a per-corpus field.
|
| 370 |
+
Each threshold gets tuned against its own golden dataset β there
|
| 371 |
+
is no "fair" shared threshold because BEIR showed these are not
|
| 372 |
+
comparable across corpora. Placeholder values ship in default.yaml
|
| 373 |
+
and are replaced by tuned values during the per-corpus evaluation
|
| 374 |
+
sweep.
|
| 375 |
+
|
| 376 |
+
## Why corpus and provider toggles compose β corpus_map[corpus][provider]
|
| 377 |
+
|
| 378 |
+
The simpler design would have been `corpus_map[corpus]` returning a
|
| 379 |
+
single orchestrator. It ships in 10 fewer lines. It also silently
|
| 380 |
+
breaks the provider toggle in multi-corpus mode: the orchestrator
|
| 381 |
+
inside each corpus cell holds one fixed provider, and clicking
|
| 382 |
+
"Anthropic" in the dashboard keeps running on OpenAI.
|
| 383 |
+
|
| 384 |
+
This project's hero-tile metric is the provider comparison (`1.00 API /
|
| 385 |
+
0.14 7B self-hosted`). Breaking the mechanism that demonstrates that
|
| 386 |
+
metric β on a portfolio demo where a reviewer will open DevTools and
|
| 387 |
+
notice β would erode the honest-evaluation brand the whole repo is
|
| 388 |
+
built around. The nested `corpus_map[corpus][provider]` structure
|
| 389 |
+
keeps both toggles functional. Store, retriever, and search tool are
|
| 390 |
+
shared across providers within a corpus (the expensive objects are
|
| 391 |
+
held once per corpus); only the orchestrator varies per provider
|
| 392 |
+
since it holds the LLM client. Per-corpus Γ per-provider memory
|
| 393 |
+
overhead is an orchestrator struct, not a FAISS index.
|
| 394 |
+
|
| 395 |
+
RSS is logged per corpus, not per corpus Γ provider, because the
|
| 396 |
+
store is what drives memory. The provider multiplier is negligible
|
| 397 |
+
compared to a hybrid index + embedder.
|
| 398 |
+
|
| 399 |
+
## Why one parameterized system prompt, not per-corpus templates
|
| 400 |
+
|
| 401 |
+
The template is `"You are a technical documentation assistant for
|
| 402 |
+
{corpus_label}..."`. The only corpus-specific element is the label;
|
| 403 |
+
prompt content is identical across corpora: same citation format,
|
| 404 |
+
same refusal language, same grounding instructions. Having two
|
| 405 |
+
separate prompt files would invite drift β someone tweaks the FastAPI
|
| 406 |
+
prompt for a specific failure mode and forgets to update the K8s
|
| 407 |
+
version, and the demo silently answers differently on the two toggles.
|
| 408 |
+
|
| 409 |
+
The parameterization is enforced by two tests: (a)
|
| 410 |
+
`format_system_prompt("")` raises `ValueError` so an unresolved
|
| 411 |
+
`{corpus_label}` can never reach the LLM, and (b) a spy on
|
| 412 |
+
`orchestrator.run_stream` asserts FastAPI and K8s requests receive
|
| 413 |
+
different prompts with the correct label substituted.
|
| 414 |
+
|
| 415 |
+
The wording deliberately differs from the typical "don't hallucinate"
|
| 416 |
+
RAG template:
|
| 417 |
+
|
| 418 |
+
- **"refuse the question explicitly"** matches our refusal-gate
|
| 419 |
+
mechanism. "Say so politely" is soft language that models interpret
|
| 420 |
+
as "hedge and answer anyway".
|
| 421 |
+
- **"do not infer, do not extrapolate, do not draw on general
|
| 422 |
+
knowledge"** is the three-verb prohibition. "Do not fabricate" is
|
| 423 |
+
empirically easier to slip past because models distinguish
|
| 424 |
+
fabrication (making things up) from extrapolation (drawing
|
| 425 |
+
conclusions from adjacent but non-authoritative context).
|
| 426 |
+
|
| 427 |
+
## Why Kubernetes curation targets recruiter-likely questions, not coverage
|
| 428 |
+
|
| 429 |
+
The K8s corpus targets ~30-40 pages curated around concepts a
|
| 430 |
+
technical reviewer would naturally type (Pod, Deployment, Service,
|
| 431 |
+
Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that
|
| 432 |
+
stress the reranker. Cluster administration deep-dives, tutorials,
|
| 433 |
+
and kubectl reference are explicitly excluded β they add noise without
|
| 434 |
+
adding reviewer value and hurt retrieval precision when adjacent
|
| 435 |
+
content is thin on concept definitions.
|
| 436 |
+
|
| 437 |
+
`data/k8s_docs/SOURCES.md` is a version-controlled curation artifact.
|
| 438 |
+
Each ingested URL has a one-line rationale, a date pulled, and a
|
| 439 |
+
license note. This makes the corpus reproducible and documents the
|
| 440 |
+
curation reasoning for any reviewer who looks closely.
|
| 441 |
+
|
| 442 |
+
Trade-off: the corpus is not comprehensive K8s knowledge. A question
|
| 443 |
+
about etcd raft internals will be correctly refused. This is not a
|
| 444 |
+
bug β the refusal is part of the demo story, and "the system knows
|
| 445 |
+
what it doesn't know" is a feature of the grounded-refusal mechanism.
|
| 446 |
+
|
| 447 |
+
## Why no cross-corpus score comparison (BEIR principle)
|
| 448 |
+
|
| 449 |
+
Per BEIR (Thakur et al., NeurIPS 2021), absolute retrieval scores are
|
| 450 |
+
not comparable across different corpora β score distributions depend
|
| 451 |
+
on chunk length, vocabulary overlap, and corpus density, none of which
|
| 452 |
+
are held constant across domains. Only rank-ordering of system
|
| 453 |
+
configurations within a single corpus is meaningful. Concrete
|
| 454 |
+
consequences for this repo:
|
| 455 |
+
|
| 456 |
+
- Per-corpus evaluation results are reported separately, never
|
| 457 |
+
aggregated into a single "combined" number.
|
| 458 |
+
- The hero-tile citation accuracy (`1.00 API / 0.14 7B self-hosted`)
|
| 459 |
+
stays FastAPI-specific. It is not restated as a cross-corpus average.
|
| 460 |
+
- `make evaluate-fast` accepts a `--corpus` flag but has no "combined"
|
| 461 |
+
mode. Anyone who wants a cross-corpus number has to run twice and
|
| 462 |
+
acknowledge the incomparability in prose.
|
| 463 |
+
- The landing page "Key Findings" cards avoid sentences that compare
|
| 464 |
+
FastAPI and K8s numbers directly.
|
| 465 |
+
|
| 466 |
+
The multi-corpus demo is a **surface feature for interactive
|
| 467 |
+
exploration**, not a rebenchmark. The benchmark section of the README
|
| 468 |
+
remains FastAPI-only and cites 27 questions on 16 docs with specific
|
| 469 |
+
chunker settings.
|
| 470 |
+
|
| 471 |
+
## K8s golden dataset uses the CRAG taxonomy
|
| 472 |
+
|
| 473 |
+
Questions in the K8s golden dataset are distributed across the
|
| 474 |
+
categories from CRAG (Yang et al., NeurIPS 2024):
|
| 475 |
+
|
| 476 |
+
- Simple fact (5-6 questions)
|
| 477 |
+
- Multi-hop (5-6)
|
| 478 |
+
- Comparison (3-4)
|
| 479 |
+
- Conditional (3-4)
|
| 480 |
+
- False-premise / unanswerable (3-4)
|
| 481 |
+
- Version-specific (2-3)
|
| 482 |
+
|
| 483 |
+
False-premise and version-specific questions stress the grounded
|
| 484 |
+
refusal mechanism. Multi-hop and comparison stress the reranker
|
| 485 |
+
because relevance spreads across multiple chunks. The distribution
|
| 486 |
+
was chosen to exercise the parts of the pipeline the benchmark story
|
| 487 |
+
claims β not to mimic a general-purpose QA benchmark.
|
| 488 |
+
|
| 489 |
+
The golden dataset JSON schema (v2, backward-compatible with the
|
| 490 |
+
FastAPI flat list) includes:
|
| 491 |
+
|
| 492 |
+
- `source_chunk_ids: list[str]` for multi-hop partial credit
|
| 493 |
+
(answer must cite at least one of the expected chunks)
|
| 494 |
+
- `source_snippets: list[str]` for human-readable context during
|
| 495 |
+
review
|
| 496 |
+
- `question_type: str` (CRAG taxonomy value)
|
| 497 |
+
- `is_multi_hop: bool` for filtered reporting
|
| 498 |
+
- Dataset-level header with `corpus`, `version`, `snapshot_date`,
|
| 499 |
+
and pinned `chunker` parameters so the dataset is reproducible
|
| 500 |
+
against a specific K8s docs snapshot
|
| 501 |
+
|
| 502 |
+
See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` for the
|
| 503 |
+
full schema and rationale.
|
| 504 |
+
|
| 505 |
+
## Cold-start contingency: measure first, lazy-load if needed
|
| 506 |
+
|
| 507 |
+
Loading two corpora at startup costs memory and cold-start time. On
|
| 508 |
+
HF Spaces (target deployment), the realistic ceiling is 8-10 GB
|
| 509 |
+
resident RAM and ~60 seconds cold-start before the demo feels broken.
|
| 510 |
+
|
| 511 |
+
**Policy:**
|
| 512 |
+
|
| 513 |
+
1. Measure HF Spaces cold-start on Day 1 of deployment.
|
| 514 |
+
2. If cold-start < 60 s: plan validated, no changes.
|
| 515 |
+
3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager,
|
| 516 |
+
K8s lazy on first K8s request). Scoped ~2 hours implementation.
|
| 517 |
+
|
| 518 |
+
This contingency is **not** pre-built. Pre-building a lazy-load path
|
| 519 |
+
that may never ship creates dead code that rots, and the test surface
|
| 520 |
+
for "lazy loading plus corpus routing plus provider switching" is
|
| 521 |
+
non-trivial. The RSS logging in `app.py` (Task 2) emits the exact
|
| 522 |
+
numbers needed to make the decision; the decision is documented here
|
| 523 |
+
so future-me remembers the threshold and doesn't optimize prematurely
|
| 524 |
+
on a hunch.
|