Nomearod commited on
Commit
361d65d
Β·
1 Parent(s): 3c0089e

docs: decisions for multi-corpus refactor

Browse files

Seven new entries appended to DECISIONS.md:

1. Per-corpus refusal thresholds β€” why a single global threshold is
wrong when FastAPI and K8s have different retrieval score
distributions.

2. Corpus Γ— provider composition β€” why the nested
corpus_map[corpus][provider] structure exists and why flat would
silently break the provider toggle in multi-corpus mode. This is
the deviation from the original plan and the one architectural
call that actually matters for the demo's credibility.

3. Single parameterized system prompt β€” why one template beats two
per-corpus templates, and why the wording deliberately differs
from the typical "don't hallucinate" RAG prompt ("refuse
explicitly", "do not infer / extrapolate / general knowledge").

4. K8s curation targets recruiter-likely questions, not coverage β€”
policy for what's in data/k8s_docs/SOURCES.md and why
etcd-internals questions should be correctly refused.

5. No cross-corpus score comparison (BEIR principle) β€” concrete
consequences for this repo, including keeping the hero-tile
citation accuracy FastAPI-specific and avoiding combined-mode
aggregation in evaluate-fast.

6. K8s golden dataset uses the CRAG taxonomy β€” distribution across
simple fact / multi-hop / comparison / conditional / false-premise /
version-specific, with the schema v2 fields that support multi-hop
partial credit.

7. Cold-start contingency β€” measure first, lazy-load only if HF
Spaces cold-start exceeds 60 s. The lazy-load path is deliberately
not pre-built because the test surface for
lazy-loading + corpus routing + provider switching is non-trivial.

No code changes β€” DECISIONS.md only. Full suite still 421 passing,
ruff clean, mypy clean.

Files changed (1) hide show
  1. DECISIONS.md +169 -0
DECISIONS.md CHANGED
@@ -353,3 +353,172 @@ reactive framework adds a dependency, interview questions about
353
  "why is there a framework for 5 state variables", and indirection
354
  that fights the imperative SSE pattern. One `state` object + a few
355
  `render()` functions handles it in ~150 lines.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
  "why is there a framework for 5 state variables", and indirection
354
  that fights the imperative SSE pattern. One `state` object + a few
355
  `render()` functions handles it in ~150 lines.
356
+
357
+ ## Why per-corpus refusal thresholds?
358
+
359
+ FastAPI and Kubernetes have different corpus characteristics. FastAPI
360
+ has 16 short, well-structured docs with sparse cross-references β€”
361
+ relevance tends to concentrate in 1-2 chunks per query. Kubernetes
362
+ has 30-40 docs with heavy cross-referencing between concepts (Pod β†’
363
+ Deployment β†’ Service β†’ Ingress), which spreads relevance across more
364
+ chunks. A single global refusal threshold would either refuse too
365
+ aggressively on K8s (no single chunk dominates, so the top score
366
+ looks "low") or not aggressively enough on FastAPI (where a
367
+ moderate-scoring chunk might be the only hit and should still refuse).
368
+
369
+ `CorpusConfig` carries `refusal_threshold` as a per-corpus field.
370
+ Each threshold gets tuned against its own golden dataset β€” there
371
+ is no "fair" shared threshold because BEIR showed these are not
372
+ comparable across corpora. Placeholder values ship in default.yaml
373
+ and are replaced by tuned values during the per-corpus evaluation
374
+ sweep.
375
+
376
+ ## Why corpus and provider toggles compose β€” corpus_map[corpus][provider]
377
+
378
+ The simpler design would have been `corpus_map[corpus]` returning a
379
+ single orchestrator. It ships in 10 fewer lines. It also silently
380
+ breaks the provider toggle in multi-corpus mode: the orchestrator
381
+ inside each corpus cell holds one fixed provider, and clicking
382
+ "Anthropic" in the dashboard keeps running on OpenAI.
383
+
384
+ This project's hero-tile metric is the provider comparison (`1.00 API /
385
+ 0.14 7B self-hosted`). Breaking the mechanism that demonstrates that
386
+ metric β€” on a portfolio demo where a reviewer will open DevTools and
387
+ notice β€” would erode the honest-evaluation brand the whole repo is
388
+ built around. The nested `corpus_map[corpus][provider]` structure
389
+ keeps both toggles functional. Store, retriever, and search tool are
390
+ shared across providers within a corpus (the expensive objects are
391
+ held once per corpus); only the orchestrator varies per provider
392
+ since it holds the LLM client. Per-corpus Γ— per-provider memory
393
+ overhead is an orchestrator struct, not a FAISS index.
394
+
395
+ RSS is logged per corpus, not per corpus Γ— provider, because the
396
+ store is what drives memory. The provider multiplier is negligible
397
+ compared to a hybrid index + embedder.
398
+
399
+ ## Why one parameterized system prompt, not per-corpus templates
400
+
401
+ The template is `"You are a technical documentation assistant for
402
+ {corpus_label}..."`. The only corpus-specific element is the label;
403
+ prompt content is identical across corpora: same citation format,
404
+ same refusal language, same grounding instructions. Having two
405
+ separate prompt files would invite drift β€” someone tweaks the FastAPI
406
+ prompt for a specific failure mode and forgets to update the K8s
407
+ version, and the demo silently answers differently on the two toggles.
408
+
409
+ The parameterization is enforced by two tests: (a)
410
+ `format_system_prompt("")` raises `ValueError` so an unresolved
411
+ `{corpus_label}` can never reach the LLM, and (b) a spy on
412
+ `orchestrator.run_stream` asserts FastAPI and K8s requests receive
413
+ different prompts with the correct label substituted.
414
+
415
+ The wording deliberately differs from the typical "don't hallucinate"
416
+ RAG template:
417
+
418
+ - **"refuse the question explicitly"** matches our refusal-gate
419
+ mechanism. "Say so politely" is soft language that models interpret
420
+ as "hedge and answer anyway".
421
+ - **"do not infer, do not extrapolate, do not draw on general
422
+ knowledge"** is the three-verb prohibition. "Do not fabricate" is
423
+ empirically easier to slip past because models distinguish
424
+ fabrication (making things up) from extrapolation (drawing
425
+ conclusions from adjacent but non-authoritative context).
426
+
427
+ ## Why Kubernetes curation targets recruiter-likely questions, not coverage
428
+
429
+ The K8s corpus targets ~30-40 pages curated around concepts a
430
+ technical reviewer would naturally type (Pod, Deployment, Service,
431
+ Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that
432
+ stress the reranker. Cluster administration deep-dives, tutorials,
433
+ and kubectl reference are explicitly excluded β€” they add noise without
434
+ adding reviewer value and hurt retrieval precision when adjacent
435
+ content is thin on concept definitions.
436
+
437
+ `data/k8s_docs/SOURCES.md` is a version-controlled curation artifact.
438
+ Each ingested URL has a one-line rationale, a date pulled, and a
439
+ license note. This makes the corpus reproducible and documents the
440
+ curation reasoning for any reviewer who looks closely.
441
+
442
+ Trade-off: the corpus is not comprehensive K8s knowledge. A question
443
+ about etcd raft internals will be correctly refused. This is not a
444
+ bug β€” the refusal is part of the demo story, and "the system knows
445
+ what it doesn't know" is a feature of the grounded-refusal mechanism.
446
+
447
+ ## Why no cross-corpus score comparison (BEIR principle)
448
+
449
+ Per BEIR (Thakur et al., NeurIPS 2021), absolute retrieval scores are
450
+ not comparable across different corpora β€” score distributions depend
451
+ on chunk length, vocabulary overlap, and corpus density, none of which
452
+ are held constant across domains. Only rank-ordering of system
453
+ configurations within a single corpus is meaningful. Concrete
454
+ consequences for this repo:
455
+
456
+ - Per-corpus evaluation results are reported separately, never
457
+ aggregated into a single "combined" number.
458
+ - The hero-tile citation accuracy (`1.00 API / 0.14 7B self-hosted`)
459
+ stays FastAPI-specific. It is not restated as a cross-corpus average.
460
+ - `make evaluate-fast` accepts a `--corpus` flag but has no "combined"
461
+ mode. Anyone who wants a cross-corpus number has to run twice and
462
+ acknowledge the incomparability in prose.
463
+ - The landing page "Key Findings" cards avoid sentences that compare
464
+ FastAPI and K8s numbers directly.
465
+
466
+ The multi-corpus demo is a **surface feature for interactive
467
+ exploration**, not a rebenchmark. The benchmark section of the README
468
+ remains FastAPI-only and cites 27 questions on 16 docs with specific
469
+ chunker settings.
470
+
471
+ ## K8s golden dataset uses the CRAG taxonomy
472
+
473
+ Questions in the K8s golden dataset are distributed across the
474
+ categories from CRAG (Yang et al., NeurIPS 2024):
475
+
476
+ - Simple fact (5-6 questions)
477
+ - Multi-hop (5-6)
478
+ - Comparison (3-4)
479
+ - Conditional (3-4)
480
+ - False-premise / unanswerable (3-4)
481
+ - Version-specific (2-3)
482
+
483
+ False-premise and version-specific questions stress the grounded
484
+ refusal mechanism. Multi-hop and comparison stress the reranker
485
+ because relevance spreads across multiple chunks. The distribution
486
+ was chosen to exercise the parts of the pipeline the benchmark story
487
+ claims β€” not to mimic a general-purpose QA benchmark.
488
+
489
+ The golden dataset JSON schema (v2, backward-compatible with the
490
+ FastAPI flat list) includes:
491
+
492
+ - `source_chunk_ids: list[str]` for multi-hop partial credit
493
+ (answer must cite at least one of the expected chunks)
494
+ - `source_snippets: list[str]` for human-readable context during
495
+ review
496
+ - `question_type: str` (CRAG taxonomy value)
497
+ - `is_multi_hop: bool` for filtered reporting
498
+ - Dataset-level header with `corpus`, `version`, `snapshot_date`,
499
+ and pinned `chunker` parameters so the dataset is reproducible
500
+ against a specific K8s docs snapshot
501
+
502
+ See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` for the
503
+ full schema and rationale.
504
+
505
+ ## Cold-start contingency: measure first, lazy-load if needed
506
+
507
+ Loading two corpora at startup costs memory and cold-start time. On
508
+ HF Spaces (target deployment), the realistic ceiling is 8-10 GB
509
+ resident RAM and ~60 seconds cold-start before the demo feels broken.
510
+
511
+ **Policy:**
512
+
513
+ 1. Measure HF Spaces cold-start on Day 1 of deployment.
514
+ 2. If cold-start < 60 s: plan validated, no changes.
515
+ 3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager,
516
+ K8s lazy on first K8s request). Scoped ~2 hours implementation.
517
+
518
+ This contingency is **not** pre-built. Pre-building a lazy-load path
519
+ that may never ship creates dead code that rots, and the test surface
520
+ for "lazy loading plus corpus routing plus provider switching" is
521
+ non-trivial. The RSS logging in `app.py` (Task 2) emits the exact
522
+ numbers needed to make the decision; the decision is documented here
523
+ so future-me remembers the threshold and doesn't optimize prematurely
524
+ on a hunch.