mbochniak01
Replace HHEM with sentence-level NLI, add claim decomposition and drift detection
ffbf46f | # Design Notes | |
| ## Key decisions and tradeoffs | |
| ### API target: own implementation | |
| Instead of wrapping a third-party fake API, the client wraps this project's own | |
| FastAPI backend. This means the client and the API are co-designed — the typed | |
| models on both sides stay in sync by design. The tradeoff: less realistic than | |
| wrapping an external API you don't control, but the test surface is richer and | |
| the integration tests verify real business logic, not just HTTP plumbing. | |
| ### Two-layer evaluation (L1 live / L2 batch) | |
| L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden | |
| dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics | |
| (contextual precision, reverse-question relevancy) add 30+ seconds per pair — | |
| unacceptable live, fine in batch. The golden dataset is the contract; L2 is the | |
| regression gate. | |
| ### Deterministic chain_terminology over LLM judge | |
| The terminology check is a dict lookup, not a model call. Zero latency, zero cost, | |
| zero false negatives on known mappings. The tradeoff: it only catches terms in the | |
| catalog — novel terminology drift goes undetected. An LLM judge would catch drift | |
| but would introduce latency and non-determinism into a metric that must be auditable. | |
| ### In-memory retrieval over vector database | |
| KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search | |
| at query time adds ~2ms retrieval overhead with no infrastructure dependency. | |
| A vector DB (Chroma, pgvector) would add operational complexity with zero | |
| retrieval quality gain at this scale. | |
| ### httpx + tenacity for the client | |
| `httpx` is the modern alternative to `requests`: native async support if needed | |
| later, cleaner timeout API, better type annotations. `tenacity` separates retry | |
| policy from request logic cleanly — the retry decorator is readable and testable | |
| independently from the HTTP code. | |
| ### Integration tests are read-only by design | |
| The API has no mutable state: queries don't persist, no records are created or | |
| deleted. Cleanup is therefore trivially satisfied — there is nothing to clean up. | |
| This is called out explicitly because it's a deliberate architectural choice, not | |
| an oversight. A stateful API (task creation, deletion) would require explicit | |
| teardown fixtures. | |
| --- | |
| ## NLI model selection — what was tried and why | |
| The faithfulness grader went through three models before converging: | |
| **Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) — purpose-built for RAG | |
| faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint | |
| is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is | |
| zero-initialized (`std=0.0`), producing constant 0.502 probability for every input. | |
| Diagnosed via weight inspection, not error message. | |
| **`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) — 3-class NLI | |
| (contradiction / entailment / neutral). Correct model family, wrong input format. | |
| NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4 | |
| sentence KB paragraph as the premise causes entailment scores to collapse — verbatim | |
| text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability | |
| across longer sequences in ways not seen during training. | |
| **`cross-encoder/nli-deberta-v3-small` (sentence-level)** — same model, fixed by splitting | |
| KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms | |
| ("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts: | |
| `ent≈0.000`, contradiction≈1.0. This is the current implementation. | |
| **Key insight:** the NLI model selection problem is a data format problem as much as a | |
| model selection problem. The same model produces correct results at sentence level and | |
| degenerate results at paragraph level. | |
| --- | |
| ## Alternative judge approaches considered | |
| ### Ollama (local LLM judge) | |
| Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to | |
| HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs: | |
| requires local GPU or accepts slower CPU inference; no external API rate limits; | |
| outputs are fully reproducible since the model version is pinned. For the | |
| faithfulness judge specifically, a local `llama3` via Ollama would remove the | |
| dependency on HF token entirely and allow offline eval runs. | |
| ### Prometheus (LLM eval framework) | |
| [Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a | |
| 7B model fine-tuned specifically for evaluation tasks — outputs a score + rationale | |
| in a structured format designed for rubric-based grading. It's a drop-in replacement | |
| for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built | |
| for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`. | |
| The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs | |
| purpose-built cross-encoder) but produces a human-readable rationale alongside the score, | |
| which is more interpretable for audit and debugging. | |
| **Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt | |
| engineering. Prometheus would be the right choice if rationale logging is a compliance requirement. | |
| --- | |
| ## What another 4 hours would add | |
| - **`eval/metrics.py` — L2 LLM metrics**: contextual precision (chunk ranking), | |
| contextual recall (coverage), and answer correctness against full reference answers. | |
| Currently only keyphrase coverage is used as a proxy. | |
| - **Async client**: `httpx.AsyncClient` variant for high-concurrency load testing. | |
| - **Property-based tests**: `hypothesis` to fuzz `check_terminology` and graders | |
| with generated strings — catches edge cases the golden dataset doesn't cover. | |
| - **CI pipeline**: GitHub Actions running `make lint`, `make type-check`, | |
| `make test` on every PR. Integration tests gated on a self-hosted runner with | |
| the API running. | |
| - **Threshold calibration report**: `eval/calibrate.py` exists and runs graders | |
| against golden-dataset expected answers — threshold calibration is now a single | |
| command, not a missing feature. Actual threshold adjustments require reviewing | |
| the output against real query distributions. | |
| ## Gate 5 audit gaps addressed | |
| - **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have | |
| enough information" responses and returns score=1.0 — no factual claims, trivially faithful. | |
| - **Partial grounding blind spot**: faithfulness now uses claim-level decomposition | |
| (`grade_faithfulness_decomposed`). Response split into sentences; each verified | |
| independently. Score = supported_claims / total_claims. A response with one hallucinated | |
| sentence in three now scores 0.667, not 1.0. | |
| - **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING | |
| log entry and sets `flagged: true` in the response payload. UI shows a red banner. | |
| - **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan. | |
| - **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term | |
| prompt injection, multi-doc synthesis, hallucination bait). | |
| - **No drift detection**: added `eval/drift.py` — KS two-sample test per metric, compares | |
| live telemetry scores against golden-dataset baseline. Detects faithfulness degradation | |
| at p < 0.05 with ~40% traffic degradation across 40+ events. | |
| --- | |
| ## Where LLM assistance helped and where it misled | |
| **Helped:** | |
| - Scaffolding the full project structure (backend, client, tests, config) in a | |
| single session without losing consistency across files. | |
| - Writing the faithfulness prompt in a way that reliably returns structured JSON — | |
| the few-shot JSON format in the prompt was a suggested pattern that works. | |
| - Catching that `except Exception` in the faithfulness grader was too broad and | |
| replacing it with `(json.JSONDecodeError, anthropic.APIError)`. | |
| - Identifying that `_build_index_by_domain` was defined twice in pipeline.py | |
| (duplicate introduced during an edit session) — caught during code review. | |
| **Misled or required correction:** | |
| - Initially used `lru_cache` on a function that takes a `SentenceTransformer` | |
| instance as an argument — unhashable, so the cache silently failed. Required | |
| switching to a module-level dict cache. | |
| - Generated a dead loop in `rosetta.py` (iterating over terms with `continue` | |
| but no code after the continue branch) that did nothing. The logic existed in | |
| a comment describing intent but was never implemented. Caught in review. | |
| - Suggested a fictional client name that conflicted with a real company. | |
| Required renaming before the repo went public. | |