File size: 8,660 Bytes
10aced5 ffbf46f e77a2f2 ffbf46f e77a2f2 ffbf46f e77a2f2 10aced5 907c06a ffbf46f 907c06a ffbf46f 10aced5 8cdbafd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | # Design Notes
## Key decisions and tradeoffs
### API target: own implementation
Instead of wrapping a third-party fake API, the client wraps this project's own
FastAPI backend. This means the client and the API are co-designed β the typed
models on both sides stay in sync by design. The tradeoff: less realistic than
wrapping an external API you don't control, but the test surface is richer and
the integration tests verify real business logic, not just HTTP plumbing.
### Two-layer evaluation (L1 live / L2 batch)
L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden
dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics
(contextual precision, reverse-question relevancy) add 30+ seconds per pair β
unacceptable live, fine in batch. The golden dataset is the contract; L2 is the
regression gate.
### Deterministic chain_terminology over LLM judge
The terminology check is a dict lookup, not a model call. Zero latency, zero cost,
zero false negatives on known mappings. The tradeoff: it only catches terms in the
catalog β novel terminology drift goes undetected. An LLM judge would catch drift
but would introduce latency and non-determinism into a metric that must be auditable.
### In-memory retrieval over vector database
KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search
at query time adds ~2ms retrieval overhead with no infrastructure dependency.
A vector DB (Chroma, pgvector) would add operational complexity with zero
retrieval quality gain at this scale.
### httpx + tenacity for the client
`httpx` is the modern alternative to `requests`: native async support if needed
later, cleaner timeout API, better type annotations. `tenacity` separates retry
policy from request logic cleanly β the retry decorator is readable and testable
independently from the HTTP code.
### Integration tests are read-only by design
The API has no mutable state: queries don't persist, no records are created or
deleted. Cleanup is therefore trivially satisfied β there is nothing to clean up.
This is called out explicitly because it's a deliberate architectural choice, not
an oversight. A stateful API (task creation, deletion) would require explicit
teardown fixtures.
---
## NLI model selection β what was tried and why
The faithfulness grader went through three models before converging:
**Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) β purpose-built for RAG
faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
Diagnosed via weight inspection, not error message.
**`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) β 3-class NLI
(contradiction / entailment / neutral). Correct model family, wrong input format.
NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3β4
sentence KB paragraph as the premise causes entailment scores to collapse β verbatim
text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
across longer sequences in ways not seen during training.
**`cross-encoder/nli-deberta-v3-small` (sentence-level)** β same model, fixed by splitting
KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
`entβ0.000`, contradictionβ1.0. This is the current implementation.
**Key insight:** the NLI model selection problem is a data format problem as much as a
model selection problem. The same model produces correct results at sentence level and
degenerate results at paragraph level.
---
## Alternative judge approaches considered
### Ollama (local LLM judge)
Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
requires local GPU or accepts slower CPU inference; no external API rate limits;
outputs are fully reproducible since the model version is pinned. For the
faithfulness judge specifically, a local `llama3` via Ollama would remove the
dependency on HF token entirely and allow offline eval runs.
### Prometheus (LLM eval framework)
[Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a
7B model fine-tuned specifically for evaluation tasks β outputs a score + rationale
in a structured format designed for rubric-based grading. It's a drop-in replacement
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
which is more interpretable for audit and debugging.
**Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt
engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.
---
## What another 4 hours would add
- **`eval/metrics.py` β L2 LLM metrics**: contextual precision (chunk ranking),
contextual recall (coverage), and answer correctness against full reference answers.
Currently only keyphrase coverage is used as a proxy.
- **Async client**: `httpx.AsyncClient` variant for high-concurrency load testing.
- **Property-based tests**: `hypothesis` to fuzz `check_terminology` and graders
with generated strings β catches edge cases the golden dataset doesn't cover.
- **CI pipeline**: GitHub Actions running `make lint`, `make type-check`,
`make test` on every PR. Integration tests gated on a self-hosted runner with
the API running.
- **Threshold calibration report**: `eval/calibrate.py` exists and runs graders
against golden-dataset expected answers β threshold calibration is now a single
command, not a missing feature. Actual threshold adjustments require reviewing
the output against real query distributions.
## Gate 5 audit gaps addressed
- **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
enough information" responses and returns score=1.0 β no factual claims, trivially faithful.
- **Partial grounding blind spot**: faithfulness now uses claim-level decomposition
(`grade_faithfulness_decomposed`). Response split into sentences; each verified
independently. Score = supported_claims / total_claims. A response with one hallucinated
sentence in three now scores 0.667, not 1.0.
- **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
log entry and sets `flagged: true` in the response payload. UI shows a red banner.
- **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
- **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
prompt injection, multi-doc synthesis, hallucination bait).
- **No drift detection**: added `eval/drift.py` β KS two-sample test per metric, compares
live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
at p < 0.05 with ~40% traffic degradation across 40+ events.
---
## Where LLM assistance helped and where it misled
**Helped:**
- Scaffolding the full project structure (backend, client, tests, config) in a
single session without losing consistency across files.
- Writing the faithfulness prompt in a way that reliably returns structured JSON β
the few-shot JSON format in the prompt was a suggested pattern that works.
- Catching that `except Exception` in the faithfulness grader was too broad and
replacing it with `(json.JSONDecodeError, anthropic.APIError)`.
- Identifying that `_build_index_by_domain` was defined twice in pipeline.py
(duplicate introduced during an edit session) β caught during code review.
**Misled or required correction:**
- Initially used `lru_cache` on a function that takes a `SentenceTransformer`
instance as an argument β unhashable, so the cache silently failed. Required
switching to a module-level dict cache.
- Generated a dead loop in `rosetta.py` (iterating over terms with `continue`
but no code after the continue branch) that did nothing. The logic existed in
a comment describing intent but was never implemented. Caught in review.
- Suggested a fictional client name that conflicted with a real company.
Required renaming before the repo went public.
|