mbochniak01
Replace HHEM with sentence-level NLI, add claim decomposition and drift detection
ffbf46f

Design Notes

Key decisions and tradeoffs

API target: own implementation

Instead of wrapping a third-party fake API, the client wraps this project's own FastAPI backend. This means the client and the API are co-designed β€” the typed models on both sides stay in sync by design. The tradeoff: less realistic than wrapping an external API you don't control, but the test surface is richer and the integration tests verify real business logic, not just HTTP plumbing.

Two-layer evaluation (L1 live / L2 batch)

L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics (contextual precision, reverse-question relevancy) add 30+ seconds per pair β€” unacceptable live, fine in batch. The golden dataset is the contract; L2 is the regression gate.

Deterministic chain_terminology over LLM judge

The terminology check is a dict lookup, not a model call. Zero latency, zero cost, zero false negatives on known mappings. The tradeoff: it only catches terms in the catalog β€” novel terminology drift goes undetected. An LLM judge would catch drift but would introduce latency and non-determinism into a metric that must be auditable.

In-memory retrieval over vector database

KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search at query time adds ~2ms retrieval overhead with no infrastructure dependency. A vector DB (Chroma, pgvector) would add operational complexity with zero retrieval quality gain at this scale.

httpx + tenacity for the client

httpx is the modern alternative to requests: native async support if needed later, cleaner timeout API, better type annotations. tenacity separates retry policy from request logic cleanly β€” the retry decorator is readable and testable independently from the HTTP code.

Integration tests are read-only by design

The API has no mutable state: queries don't persist, no records are created or deleted. Cleanup is therefore trivially satisfied β€” there is nothing to clean up. This is called out explicitly because it's a deliberate architectural choice, not an oversight. A stateful API (task creation, deletion) would require explicit teardown fixtures.


NLI model selection β€” what was tried and why

The faithfulness grader went through three models before converging:

Vectara HHEM v2 (vectara/hallucination_evaluation_model) β€” purpose-built for RAG faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint is missing t5.transformer.encoder.embed_tokens.weight. The embedding matrix is zero-initialized (std=0.0), producing constant 0.502 probability for every input. Diagnosed via weight inspection, not error message.

cross-encoder/nli-deberta-v3-small (first attempt, paragraph-level) β€” 3-class NLI (contradiction / entailment / neutral). Correct model family, wrong input format. NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4 sentence KB paragraph as the premise causes entailment scores to collapse β€” verbatim text scores ent=0.002, treated as neutral. Root cause: model distributes probability across longer sequences in ways not seen during training.

cross-encoder/nli-deberta-v3-small (sentence-level) β€” same model, fixed by splitting KB chunks into individual sentences before scoring. Verbatim: ent=0.995. Aliased terms ("item registry" vs "product catalog (item registry)"): ent=0.989. Hallucinated facts: entβ‰ˆ0.000, contradictionβ‰ˆ1.0. This is the current implementation.

Key insight: the NLI model selection problem is a data format problem as much as a model selection problem. The same model produces correct results at sentence level and degenerate results at paragraph level.


Alternative judge approaches considered

Ollama (local LLM judge)

Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs: requires local GPU or accepts slower CPU inference; no external API rate limits; outputs are fully reproducible since the model version is pinned. For the faithfulness judge specifically, a local llama3 via Ollama would remove the dependency on HF token entirely and allow offline eval runs.

Prometheus (LLM eval framework)

Prometheus-2 is a 7B model fine-tuned specifically for evaluation tasks β€” outputs a score + rationale in a structured format designed for rubric-based grading. It's a drop-in replacement for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built for the kind of faithfulness + relevancy scoring done in eval/metrics.py. The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs purpose-built cross-encoder) but produces a human-readable rationale alongside the score, which is more interpretable for audit and debugging.

Why not used here: the cross-encoder NLI approach runs faster and requires no prompt engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.


What another 4 hours would add

  • eval/metrics.py β€” L2 LLM metrics: contextual precision (chunk ranking), contextual recall (coverage), and answer correctness against full reference answers. Currently only keyphrase coverage is used as a proxy.
  • Async client: httpx.AsyncClient variant for high-concurrency load testing.
  • Property-based tests: hypothesis to fuzz check_terminology and graders with generated strings β€” catches edge cases the golden dataset doesn't cover.
  • CI pipeline: GitHub Actions running make lint, make type-check, make test on every PR. Integration tests gated on a self-hosted runner with the API running.
  • Threshold calibration report: eval/calibrate.py exists and runs graders against golden-dataset expected answers β€” threshold calibration is now a single command, not a missing feature. Actual threshold adjustments require reviewing the output against real query distributions.

Gate 5 audit gaps addressed

  • Faithfulness false negatives on refusals: _is_refusal() detects "I don't have enough information" responses and returns score=1.0 β€” no factual claims, trivially faithful.
  • Partial grounding blind spot: faithfulness now uses claim-level decomposition (grade_faithfulness_decomposed). Response split into sentences; each verified independently. Score = supported_claims / total_claims. A response with one hallucinated sentence in three now scores 0.667, not 1.0.
  • No escalation path: overall_pass=False now emits a structured EVAL_FAIL WARNING log entry and sets flagged: true in the response payload. UI shows a red banner.
  • Cold-start latency: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
  • Happy-path-only golden dataset: 4 adversarial pairs added (vague query, rival-term prompt injection, multi-doc synthesis, hallucination bait).
  • No drift detection: added eval/drift.py β€” KS two-sample test per metric, compares live telemetry scores against golden-dataset baseline. Detects faithfulness degradation at p < 0.05 with ~40% traffic degradation across 40+ events.

Where LLM assistance helped and where it misled

Helped:

  • Scaffolding the full project structure (backend, client, tests, config) in a single session without losing consistency across files.
  • Writing the faithfulness prompt in a way that reliably returns structured JSON β€” the few-shot JSON format in the prompt was a suggested pattern that works.
  • Catching that except Exception in the faithfulness grader was too broad and replacing it with (json.JSONDecodeError, anthropic.APIError).
  • Identifying that _build_index_by_domain was defined twice in pipeline.py (duplicate introduced during an edit session) β€” caught during code review.

Misled or required correction:

  • Initially used lru_cache on a function that takes a SentenceTransformer instance as an argument β€” unhashable, so the cache silently failed. Required switching to a module-level dict cache.
  • Generated a dead loop in rosetta.py (iterating over terms with continue but no code after the continue branch) that did nothing. The logic existed in a comment describing intent but was never implemented. Caught in review.
  • Suggested a fictional client name that conflicted with a real company. Required renaming before the repo went public.