Spaces:

below-threshold
/

ai-response-validator

Running

App Files Files Community

ai-response-validator / NOTES.md

mbochniak01

Replace HHEM with sentence-level NLI, add claim decomposition and drift detection

ffbf46f 2 days ago

preview code

raw

history blame contribute delete

8.66 kB

	# Design Notes

	## Key decisions and tradeoffs

	### API target: own implementation
	Instead of wrapping a third-party fake API, the client wraps this project's own
	FastAPI backend. This means the client and the API are co-designed — the typed
	models on both sides stay in sync by design. The tradeoff: less realistic than
	wrapping an external API you don't control, but the test surface is richer and
	the integration tests verify real business logic, not just HTTP plumbing.

	### Two-layer evaluation (L1 live / L2 batch)
	L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden
	dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics
	(contextual precision, reverse-question relevancy) add 30+ seconds per pair —
	unacceptable live, fine in batch. The golden dataset is the contract; L2 is the
	regression gate.

	### Deterministic chain_terminology over LLM judge
	The terminology check is a dict lookup, not a model call. Zero latency, zero cost,
	zero false negatives on known mappings. The tradeoff: it only catches terms in the
	catalog — novel terminology drift goes undetected. An LLM judge would catch drift
	but would introduce latency and non-determinism into a metric that must be auditable.

	### In-memory retrieval over vector database
	KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search
	at query time adds ~2ms retrieval overhead with no infrastructure dependency.
	A vector DB (Chroma, pgvector) would add operational complexity with zero
	retrieval quality gain at this scale.

	### httpx + tenacity for the client
	`httpx` is the modern alternative to `requests`: native async support if needed
	later, cleaner timeout API, better type annotations. `tenacity` separates retry
	policy from request logic cleanly — the retry decorator is readable and testable
	independently from the HTTP code.

	### Integration tests are read-only by design
	The API has no mutable state: queries don't persist, no records are created or
	deleted. Cleanup is therefore trivially satisfied — there is nothing to clean up.
	This is called out explicitly because it's a deliberate architectural choice, not
	an oversight. A stateful API (task creation, deletion) would require explicit
	teardown fixtures.

	---

	## NLI model selection — what was tried and why

	The faithfulness grader went through three models before converging:

	Vectara HHEM v2 (`vectara/hallucination_evaluation_model`) — purpose-built for RAG
	faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
	is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
	zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
	Diagnosed via weight inspection, not error message.

	`cross-encoder/nli-deberta-v3-small` (first attempt, paragraph-level) — 3-class NLI
	(contradiction / entailment / neutral). Correct model family, wrong input format.
	NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4
	sentence KB paragraph as the premise causes entailment scores to collapse — verbatim
	text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
	across longer sequences in ways not seen during training.

	`cross-encoder/nli-deberta-v3-small` (sentence-level) — same model, fixed by splitting
	KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
	("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
	`ent≈0.000`, contradiction≈1.0. This is the current implementation.

	Key insight: the NLI model selection problem is a data format problem as much as a
	model selection problem. The same model produces correct results at sentence level and
	degenerate results at paragraph level.

	---

	## Alternative judge approaches considered

	### Ollama (local LLM judge)
	Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
	HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
	requires local GPU or accepts slower CPU inference; no external API rate limits;
	outputs are fully reproducible since the model version is pinned. For the
	faithfulness judge specifically, a local `llama3` via Ollama would remove the
	dependency on HF token entirely and allow offline eval runs.

	### Prometheus (LLM eval framework)
	[Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a
	7B model fine-tuned specifically for evaluation tasks — outputs a score + rationale
	in a structured format designed for rubric-based grading. It's a drop-in replacement
	for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
	for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
	The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
	purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
	which is more interpretable for audit and debugging.

	Why not used here: the cross-encoder NLI approach runs faster and requires no prompt
	engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.

	---

	## What another 4 hours would add

	- `eval/metrics.py` — L2 LLM metrics: contextual precision (chunk ranking),
	contextual recall (coverage), and answer correctness against full reference answers.
	Currently only keyphrase coverage is used as a proxy.
	- Async client: `httpx.AsyncClient` variant for high-concurrency load testing.
	- Property-based tests: `hypothesis` to fuzz `check_terminology` and graders
	with generated strings — catches edge cases the golden dataset doesn't cover.
	- CI pipeline: GitHub Actions running `make lint`, `make type-check`,
	`make test` on every PR. Integration tests gated on a self-hosted runner with
	the API running.
	- Threshold calibration report: `eval/calibrate.py` exists and runs graders
	against golden-dataset expected answers — threshold calibration is now a single
	command, not a missing feature. Actual threshold adjustments require reviewing
	the output against real query distributions.

	## Gate 5 audit gaps addressed

	- Faithfulness false negatives on refusals: `_is_refusal()` detects "I don't have
	enough information" responses and returns score=1.0 — no factual claims, trivially faithful.
	- Partial grounding blind spot: faithfulness now uses claim-level decomposition
	(`grade_faithfulness_decomposed`). Response split into sentences; each verified
	independently. Score = supported_claims / total_claims. A response with one hallucinated
	sentence in three now scores 0.667, not 1.0.
	- No escalation path: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
	log entry and sets `flagged: true` in the response payload. UI shows a red banner.
	- Cold-start latency: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
	- Happy-path-only golden dataset: 4 adversarial pairs added (vague query, rival-term
	prompt injection, multi-doc synthesis, hallucination bait).
	- No drift detection: added `eval/drift.py` — KS two-sample test per metric, compares
	live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
	at p < 0.05 with ~40% traffic degradation across 40+ events.

	---

	## Where LLM assistance helped and where it misled

	Helped:
	- Scaffolding the full project structure (backend, client, tests, config) in a
	single session without losing consistency across files.
	- Writing the faithfulness prompt in a way that reliably returns structured JSON —
	the few-shot JSON format in the prompt was a suggested pattern that works.
	- Catching that `except Exception` in the faithfulness grader was too broad and
	replacing it with `(json.JSONDecodeError, anthropic.APIError)`.
	- Identifying that `_build_index_by_domain` was defined twice in pipeline.py
	(duplicate introduced during an edit session) — caught during code review.

	Misled or required correction:
	- Initially used `lru_cache` on a function that takes a `SentenceTransformer`
	instance as an argument — unhashable, so the cache silently failed. Required
	switching to a module-level dict cache.
	- Generated a dead loop in `rosetta.py` (iterating over terms with `continue`
	but no code after the continue branch) that did nothing. The logic existed in
	a comment describing intent but was never implemented. Caught in review.
	- Suggested a fictional client name that conflicted with a real company.
	Required renaming before the repo went public.