Spaces:

below-threshold
/

ai-response-validator

Sleeping

mbochniak01 Claude Sonnet 4.6 commited on 6 days ago

Commit

ffbf46f

1 Parent(s): e181667

Replace HHEM with sentence-level NLI, add claim decomposition and drift detection

Faithfulness grader:
- Replace broken Vectara HHEM v2 (zero embedding matrix) with cross-encoder/nli-deberta-v3-small
- Add decompose_claims() — splits response into atomic sentences for per-claim verification
- Add _context_sentences() — splits KB chunks into sentences before NLI scoring; fixes
paragraph-level entailment collapse (verbatim text was scoring ent=0.002 at paragraph level,
ent=0.995 at sentence level including aliased terms like "item registry" vs "product catalog")
- grade_faithfulness_decomposed() promoted to default in grade(); score = supported/total claims

Drift detection:
- eval/drift.py: KS two-sample test per metric vs golden-dataset baseline
- eval/compare_faithfulness.py: side-by-side whole-response vs claim-level scores
- eval/simulate_traffic.py: clean + hallucinated traffic simulation for drift testing
- tests/unit/test_drift.py: 12 unit tests for detect_drift()

Docs updated to reflect all changes (ARCHITECTURE.md, NOTES.md, README.md)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (9) hide show

ARCHITECTURE.md +69 -32
NOTES.md +38 -6
README.md +15 -4
backend/grader.py +84 -42
eval/compare_faithfulness.py +132 -0
eval/drift.py +200 -0
eval/simulate_traffic.py +190 -0
tests/unit/test_drift.py +116 -0
tests/unit/test_grader.py +115 -0

ARCHITECTURE.md CHANGED Viewed

@@ -60,7 +60,7 @@ Runs inline with every request. No ground truth required.
 | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate — fails hard |
 | `token_budget` | Char count ÷ 4 | ≤ 512 tokens | Conciseness enforcement |
 | `answer_relevancy` | Cosine similarity (bi-encoder) | ≥ 0.45 | On-topic detection |
-| `faithfulness` | Vectara HHEM v2 cross-encoder | ≥ 0.35 | Hallucination detection |
 | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
 ### L2 — Batch (local, against golden dataset)
@@ -68,10 +68,14 @@ Runs inline with every request. No ground truth required.
 ```bash
 python eval/metrics.py --domain retail
 python eval/metrics.py --client novamart --out results.json
 ```
-Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
-on top of L1 metrics to verify factual completeness against reference answers.
 ---
@@ -85,25 +89,25 @@ Two fundamentally different model architectures serve different roles in this sy
 |---|---|---|
 | **How it works** | Encodes query and document independently → compare embeddings | Encodes query + document jointly → single relevance score |
 | **Speed** | Fast — embeddings pre-computed at index build time | Slow — must re-encode every (query, doc) pair at inference |
-| **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment |
-| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) |
 **Measured overhead (CPU, HF Spaces):**
 | Step | Model | Typical latency |
 |------|-------|----------------|
 | Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
-| KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms |
 | Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
-| Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300–600 ms |
-| Total grading overhead | — | ~350–650 ms |
 **Why bi-encoder for retrieval:** query time is constant regardless of KB size because
 document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
 query latency — only index build time grows.
 **Why cross-encoder for faithfulness:** cross-encoders see both the document and the
-response simultaneously, capturing entailment relationships bi-encoders miss. A response
 can be semantically similar to a document (high cosine) while still hallucinating specific
 facts — the cross-encoder catches this, the bi-encoder does not.
@@ -124,22 +128,23 @@ It flags rival-client terms appearing without the correct client term.
 **Why this matters:** in production multi-tenant AI systems, terminology leakage
 between clients is a real failure mode. This catches it mechanically.
-### Faithfulness via Vectara HHEM v2
-The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) —
-a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
-It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
-the response is factually consistent with the document.
-**Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic;
-requires prompt engineering to produce consistent scores. A purpose-built cross-encoder
-is faster, cheaper, and more consistent for this specific task.
-**Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment
-benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from
-a premise — a different task. Correct, grounded answers score near zero on NLI entailment,
-causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG
-systems, which maps directly to this use case.
 ### In-memory semantic retrieval
@@ -193,8 +198,12 @@ knowledge/
     features.yaml       KB documents for retrieval
 eval/
-  golden-dataset.yaml   20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
-  metrics.py            L2 batch runner — CLI, keyphrase scoring, HTML report
 ui/
   index.html    Chat interface + eval panel
@@ -207,7 +216,10 @@ ui/
 | Decision | Alternative | Why this |
 |----------|-------------|----------|
-| Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic |
 | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
 | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
 | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
@@ -219,10 +231,35 @@ ui/
 ## Evaluation coverage vs RAGAS
-| RAGAS metric | Coverage |
-|---|---|
-| faithfulness | ✓ L1 (Claude judge) |
-| answer_relevancy | ✓ L1 (cosine) + L2 (keyphrase) |
-| context_precision | partial — retrieval score visible in UI |
-| context_recall | ✓ L2 (keyphrase coverage) |
-| answer_correctness | ✓ L2 (keyphrase + expected_answer) |

 | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate — fails hard |
 | `token_budget` | Char count ÷ 4 | ≤ 512 tokens | Conciseness enforcement |
 | `answer_relevancy` | Cosine similarity (bi-encoder) | ≥ 0.45 | On-topic detection |
+| `faithfulness` | Claim decomposition + sentence-level NLI | ≥ 0.35 (proportion) | Hallucination detection, claim-level granularity |
 | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
 ### L2 — Batch (local, against golden dataset)
 ```bash
 python eval/metrics.py --domain retail
 python eval/metrics.py --client novamart --out results.json
+python eval/calibrate.py          # threshold distribution on golden answers
+python eval/compare_faithfulness.py  # whole-response vs claim-level side-by-side
+python eval/drift.py              # KS drift detection vs golden baseline
 ```
+Runs golden pairs through the full pipeline. Adds keyphrase coverage scoring on top of
+L1 metrics to verify factual completeness against reference answers. `drift.py` compares
+live telemetry score distributions against the golden baseline using KS two-sample tests.
 ---
 |---|---|---|
 | **How it works** | Encodes query and document independently → compare embeddings | Encodes query + document jointly → single relevance score |
 | **Speed** | Fast — embeddings pre-computed at index build time | Slow — must re-encode every (query, doc) pair at inference |
+| **Quality** | Good for retrieval: finds semantically similar docs | Better for NLI: captures fine-grained entailment between short sequences |
+| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (`nli-deberta-v3-small`) |
 **Measured overhead (CPU, HF Spaces):**
 | Step | Model | Typical latency |
 |------|-------|----------------|
 | Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
+| KB cosine search | numpy matrix multiply | ~2 ms |
 | Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
+| Faithfulness (N claim × M sentence pairs) | cross-encoder NLI | ~200–500 ms |
+| Total grading overhead | — | ~250–550 ms |
 **Why bi-encoder for retrieval:** query time is constant regardless of KB size because
 document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
 query latency — only index build time grows.
 **Why cross-encoder for faithfulness:** cross-encoders see both the document and the
+claim simultaneously, capturing entailment relationships bi-encoders miss. A response
 can be semantically similar to a document (high cosine) while still hallucinating specific
 facts — the cross-encoder catches this, the bi-encoder does not.
 **Why this matters:** in production multi-tenant AI systems, terminology leakage
 between clients is a real failure mode. This catches it mechanically.
+### Faithfulness: claim decomposition + sentence-level NLI
+The faithfulness grader (`grade_faithfulness_decomposed`) uses a three-step pipeline:
+1. **Claim decomposition** — response split into individual sentences via regex. Each sentence is an atomic claim to verify independently.
+2. **Context sentence splitting** — KB chunks split into individual sentences before scoring. NLI cross-encoders are calibrated on sentence-pair inputs (SNLI/MNLI training format); paragraph-level inputs degrade performance significantly (verbatim text scores near 0.002 entailment when the context is a 3–4 sentence paragraph).
+3. **Per-claim NLI scoring** — each claim scored against every context sentence. Claim is "supported" if max entailment ≥ threshold. Score = supported_claims / total_claims.
+**Model:** `cross-encoder/nli-deberta-v3-small` — 3-class NLI (contradiction / entailment / neutral). Entailment column used. Sentence-level inputs give `ent ≥ 0.98` for verbatim and aliased claims ("item registry" vs "product catalog (item registry)").
+**Why claim-level not whole-response NLI:** whole-response NLI misses partial hallucinations. A 4-sentence response with 3 correct sentences and 1 fabricated one scores high because the model finds one well-grounded sentence. Claim-level scores 3/4 = 0.75 and exposes the fabrication in metadata.
+**Why sentence-level context not paragraph-level:** NLI cross-encoders are trained on single (premise, hypothesis) sentence pairs. Feeding a paragraph as premise causes entailment scores to collapse — the model distributes probability mass across longer sequences in ways not seen during training. Sentence-level splitting resolves alias mismatches too: `"pricing sync"` vs `"Price updates (pricing syncs) must be submitted..."` scores `ent=0.986` at sentence level.
+**Why not Claude-as-judge for L1:** adds API cost and latency per query; non-deterministic. The cross-encoder handles L1; LLM-as-judge belongs in L2 batch evaluation for authoritative ground-truth comparison.
+**Why not Vectara HHEM v2:** HHEM v2 checkpoint is missing `t5.transformer.encoder.embed_tokens.weight` — the embedding matrix is zero-initialized, producing a constant 0.502 probability for every input regardless of content. Diagnosed via `embed_tokens.std() == 0.0`.
 ### In-memory semantic retrieval
     features.yaml       KB documents for retrieval
 eval/
+  golden-dataset.yaml      24 Q&A pairs (20 standard + 4 adversarial edge cases)
+  metrics.py               L2 batch runner — CLI, keyphrase scoring, HTML report
+  calibrate.py             Threshold calibration — score distributions on golden answers
+  compare_faithfulness.py  Side-by-side: whole-response vs claim-level faithfulness scores
+  drift.py                 KS drift detection — live telemetry vs golden baseline
+  simulate_traffic.py      Populate telemetry with clean + hallucinated traffic for drift testing
 ui/
   index.html    Chat interface + eval panel
 | Decision | Alternative | Why this |
 |----------|-------------|----------|
+| `nli-deberta-v3-small` + sentence splitting | Vectara HHEM v2 / Claude-as-judge | HHEM broken (zero embeddings); DeBERTa works at sentence level; no API cost |
+| Claim-level faithfulness (proportion) | Whole-response NLI | Whole-response misses partial hallucinations; claim-level exposes them in metadata |
+| Sentence-level context splitting | Full paragraph as NLI premise | NLI models calibrated on sentence pairs; paragraph inputs collapse entailment scores |
+| KS two-sample test for drift | Evidently DataDriftPreset | Same statistical test, no extra dependency (scipy via scikit-learn) |
 | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
 | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
 | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
 ## Evaluation coverage vs RAGAS
+| RAGAS metric | Coverage | Notes |
+|---|---|---|
+| faithfulness | ✓ L1 (claim-level NLI) | `grade_faithfulness_decomposed()` — sentence-level cross-encoder |
+| answer_relevancy | ✓ L1 (cosine) + L2 (keyphrase) | Bi-encoder cosine; LLM-based in L2 |
+| context_precision | partial — retrieval score in UI | No rank-weighted precision@k |
+| context_recall | ✓ L2 (keyphrase coverage) | Keyphrases as proxy for claim coverage |
+| answer_correctness | ✓ L2 (keyphrase + expected_answer) | |
+## Drift detection
+`eval/drift.py` detects distribution shift in grader scores between live traffic and the golden-dataset baseline.
+```
+reference = build_reference()   # run all graders on golden-dataset expected_answers
+current   = build_current()     # pull metric scores from telemetry._events
+results = detect_drift(current, reference, alpha=0.05)
+# → per-metric: ks_statistic, p_value, drifted, ref_mean, cur_mean, delta
+```
+**Statistical test:** KS two-sample (Kolmogorov–Smirnov). Same test as Evidently `DataDriftPreset` for numerical columns. Detects any shift in distribution shape, not just mean change.
+**Sensitivity:** with n_ref=24 golden pairs, KS test reaches p < 0.05 at ~40% traffic degradation (n_cur=40+). Smaller effects require larger current sample windows.
+**What each metric's drift signals:**
+| Metric | Drift means |
+|--------|-------------|
+| `faithfulness` | Model hallucinating more / KB stale / retrieval returning wrong docs |
+| `answer_relevancy` | Query distribution shifted / model off-topic |
+| `chain_terminology` | Terminology catalog misaligned with model outputs |
+| `pii_leakage` / `token_budget` | Structural output format changed |

NOTES.md CHANGED Viewed

@@ -43,6 +43,34 @@ teardown fixtures.
 ---
 ## Alternative judge approaches considered
 ### Ollama (local LLM judge)
@@ -59,12 +87,12 @@ dependency on HF token entirely and allow offline eval runs.
 in a structured format designed for rubric-based grading. It's a drop-in replacement
 for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
 for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
-The tradeoff vs. the current Vectara HHEM v2 approach: Prometheus is slower (7B vs
 purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
 which is more interpretable for audit and debugging.
-**Why not used here:** HHEM v2 runs faster and requires no prompt engineering.
-Prometheus would be the right choice if rationale logging is a compliance requirement.
 ---
@@ -88,14 +116,18 @@ Prometheus would be the right choice if rationale logging is a compliance requir
 - **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
   enough information" responses and returns score=1.0 — no factual claims, trivially faithful.
-- **Partial grounding blind spot**: faithfulness now uses sentence-level min-score (weakest
-  link wins) instead of max-score across chunks. A response with one hallucinated sentence
-  now fails even if other sentences are grounded.
 - **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
   log entry and sets `flagged: true` in the response payload. UI shows a red banner.
 - **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
 - **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
   prompt injection, multi-doc synthesis, hallucination bait).
 ---

 ---
+## NLI model selection — what was tried and why
+The faithfulness grader went through three models before converging:
+**Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) — purpose-built for RAG
+faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
+is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
+zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
+Diagnosed via weight inspection, not error message.
+**`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) — 3-class NLI
+(contradiction / entailment / neutral). Correct model family, wrong input format.
+NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4
+sentence KB paragraph as the premise causes entailment scores to collapse — verbatim
+text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
+across longer sequences in ways not seen during training.
+**`cross-encoder/nli-deberta-v3-small` (sentence-level)** — same model, fixed by splitting
+KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
+("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
+`ent≈0.000`, contradiction≈1.0. This is the current implementation.
+**Key insight:** the NLI model selection problem is a data format problem as much as a
+model selection problem. The same model produces correct results at sentence level and
+degenerate results at paragraph level.
+---
 ## Alternative judge approaches considered
 ### Ollama (local LLM judge)
 in a structured format designed for rubric-based grading. It's a drop-in replacement
 for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
 for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
+The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
 purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
 which is more interpretable for audit and debugging.
+**Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt
+engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.
 ---
 - **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
   enough information" responses and returns score=1.0 — no factual claims, trivially faithful.
+- **Partial grounding blind spot**: faithfulness now uses claim-level decomposition
+  (`grade_faithfulness_decomposed`). Response split into sentences; each verified
+  independently. Score = supported_claims / total_claims. A response with one hallucinated
+  sentence in three now scores 0.667, not 1.0.
 - **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
   log entry and sets `flagged: true` in the response payload. UI shows a red banner.
 - **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
 - **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
   prompt injection, multi-doc synthesis, hallucination bait).
+- **No drift detection**: added `eval/drift.py` — KS two-sample test per metric, compares
+  live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
+  at p < 0.05 with ~40% traffic degradation across 40+ events.
 ---

README.md CHANGED Viewed

@@ -57,13 +57,23 @@ All tests are stateless — no cleanup required.
 ## Batch evaluation (L2)
 ```bash
-make eval-retail      # evaluate 10 retail Q&A pairs, open HTML report
-make eval-pharma      # evaluate 10 pharma Q&A pairs, open HTML report
-make eval             # all 20 pairs
 ```
 Reports are written to `eval/reports/`.
 ---
 ## Code quality
@@ -133,8 +143,9 @@ See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency
 | PII Leakage | L1 live | Regex scan — binary |
 | Token Budget | L1 live | Char count ÷ 4 |
 | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
-| Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
 | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
 | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
 **Core principle:** no single metric proves correctness. The combination does.

 ## Batch evaluation (L2)
 ```bash
+make eval-retail      # evaluate retail Q&A pairs, open HTML report
+make eval-pharma      # evaluate pharma Q&A pairs, open HTML report
+make eval             # all domains
 ```
 Reports are written to `eval/reports/`.
+**Drift detection** (no server required):
+```bash
+python eval/simulate_traffic.py   # populate telemetry + run drift report
+python eval/drift.py              # drift report against live telemetry
+```
+Compares live grader score distributions against the golden-dataset baseline using KS tests.
+Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.
 ---
 ## Code quality
 | PII Leakage | L1 live | Regex scan — binary |
 | Token Budget | L1 live | Char count ÷ 4 |
 | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
+| Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
 | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
 | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
+| Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |
 **Core principle:** no single metric proves correctness. The combination does.

backend/grader.py CHANGED Viewed

@@ -5,7 +5,7 @@ Metrics:
   pii_leakage        — regex scan for PII patterns in response
   token_budget       — response within allowed token ceiling
   answer_relevancy   — cosine similarity between query and response embeddings
-  faithfulness       — Vectara HHEM v2: RAG faithfulness probability per (doc, response) pair
   chain_terminology  — deterministic: client-specific terms used (via RosettaStone)
 """
@@ -14,19 +14,20 @@ import re
 from dataclasses import dataclass, field
 from typing import Any
 from config import EMBEDDER_MODEL
 from rosetta import check_terminology
-from sentence_transformers import SentenceTransformer
 from sklearn.metrics.pairwise import cosine_similarity
-from transformers import T5Tokenizer
-from transformers import pipeline as hf_pipeline
 log = logging.getLogger(__name__)
 _embedder: SentenceTransformer | None = None
-_nli_model: Any = None
-NLI_MODEL = "vectara/hallucination_evaluation_model"
 def get_embedder() -> SentenceTransformer:
@@ -37,31 +38,11 @@ def get_embedder() -> SentenceTransformer:
     return _embedder
-def get_nli_model() -> Any:
-    """Return the shared Vectara faithfulness pipeline, loading it on first call."""
     global _nli_model
     if _nli_model is None:
-        # HHEMv2 doesn't call post_init() in __init__, so all_tied_weights_keys is never
-        # set — transformers 5.x requires it in _finalize_model_loading. Patch before load.
-        from transformers import PreTrainedModel
-        _orig = PreTrainedModel.mark_tied_weights_as_initialized
-        def _patched(self: Any, loading_info: Any) -> None:
-            if not hasattr(self, "all_tied_weights_keys"):
-                self.all_tied_weights_keys = {}
-            _orig(self, loading_info)  # type: ignore[no-untyped-call]
-        PreTrainedModel.mark_tied_weights_as_initialized = _patched  # type: ignore[method-assign]
-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-        _nli_model = hf_pipeline(
-            "text-classification",
-            model=NLI_MODEL,
-            tokenizer=tokenizer,
-            trust_remote_code=True,
-            truncation=True,
-            max_length=512,
-        )
-        PreTrainedModel.mark_tied_weights_as_initialized = _orig  # type: ignore[method-assign]
     return _nli_model
@@ -95,6 +76,8 @@ class GradeReport:
         }
 _PII_PATTERNS = [
     (r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
     (r"\b\d{16}\b", "credit card"),
@@ -163,8 +146,28 @@ def _strip_chunk_title(chunk: str) -> str:
     return chunk
 def grade_faithfulness(response: str, context: str) -> GradeResult:
-    """Faithfulness via Vectara hallucination model: scores (document, response) pairs directly."""
     if _is_refusal(response):
         return GradeResult(
             metric="faithfulness", passed=True, score=1.0,
@@ -175,24 +178,63 @@ def grade_faithfulness(response: str, context: str) -> GradeResult:
     if not raw_chunks:
         return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
     chunks = [_strip_chunk_title(c) for c in raw_chunks]
-    # text_pair encodes sequences with T5 </s> separator — correct for T5-based models.
-    pairs = [{"text": chunk, "text_pair": response} for chunk in chunks]
-    results = model(pairs)
-    log.info("Vectara raw: %s", [(r["label"], round(r["score"], 3)) for r in results])
-    scores = [
-        r["score"] if r["label"].lower().startswith("factually consistent") else 1.0 - r["score"]
-        for r in results
-    ]
-    score = float(max(scores))
-    passed = score >= FAITHFULNESS_THRESHOLD
     return GradeResult(
         metric="faithfulness",
-        passed=passed,
         score=score,
         detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
     )
 def grade_chain_terminology(response: str, client: str) -> GradeResult:
     """Check that the response uses client-specific terms, not rival terminology."""
     result = check_terminology(response, client)
@@ -226,7 +268,7 @@ def grade(
         grade_pii_leakage(response),
         grade_token_budget(response, token_budget),
         grade_answer_relevancy(query, response),
-        grade_faithfulness(response, context),
         grade_chain_terminology(response, client),
     ]
     return report

   pii_leakage        — regex scan for PII patterns in response
   token_budget       — response within allowed token ceiling
   answer_relevancy   — cosine similarity between query and response embeddings
+  faithfulness       — NLI cross-encoder: entailment score per (chunk, claim) pair
   chain_terminology  — deterministic: client-specific terms used (via RosettaStone)
 """
 from dataclasses import dataclass, field
 from typing import Any
+import numpy as np
 from config import EMBEDDER_MODEL
 from rosetta import check_terminology
+from sentence_transformers import CrossEncoder, SentenceTransformer
 from sklearn.metrics.pairwise import cosine_similarity
 log = logging.getLogger(__name__)
 _embedder: SentenceTransformer | None = None
+_nli_model: CrossEncoder | None = None
+# cross-encoder/nli-deberta-v3-small: 3-class NLI, columns = [contradiction, entailment, neutral]
+NLI_MODEL = "cross-encoder/nli-deberta-v3-small"
+_NLI_ENTAILMENT_IDX = 1
 def get_embedder() -> SentenceTransformer:
     return _embedder
+def get_nli_model() -> CrossEncoder:
+    """Return the shared NLI cross-encoder, loading it on first call."""
     global _nli_model
     if _nli_model is None:
+        _nli_model = CrossEncoder(NLI_MODEL)
     return _nli_model
         }
+_SENTENCE_SPLIT = re.compile(r"(?<=[.!?])\s+")
 _PII_PATTERNS = [
     (r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
     (r"\b\d{16}\b", "credit card"),
     return chunk
+def decompose_claims(response: str) -> list[str]:
+    """Split response into atomic claim sentences (≥3 words each)."""
+    sentences = _SENTENCE_SPLIT.split(response.strip())
+    return [s.strip() for s in sentences if len(s.split()) >= 3]
+def _context_sentences(chunks: list[str]) -> list[str]:
+    """Flatten context chunks into individual sentences for sentence-level NLI scoring.
+    Cross-encoder NLI degrades on multi-sentence inputs — performance is calibrated
+    on single-sentence (premise, hypothesis) pairs matching the SNLI/MNLI training format.
+    """
+    sentences = []
+    for chunk in chunks:
+        for s in _SENTENCE_SPLIT.split(chunk.strip()):
+            if len(s.split()) >= 3:
+                sentences.append(s.strip())
+    return sentences
 def grade_faithfulness(response: str, context: str) -> GradeResult:
+    """Whole-response faithfulness: max entailment score across all context chunks."""
     if _is_refusal(response):
         return GradeResult(
             metric="faithfulness", passed=True, score=1.0,
     if not raw_chunks:
         return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
     chunks = [_strip_chunk_title(c) for c in raw_chunks]
+    sentences = _context_sentences(chunks)
+    pairs = [(s, response) for s in sentences]
+    scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
+    entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
+    log.info("NLI entailment scores: %s", [round(float(s), 3) for s in entailment])
+    score = float(entailment.max())
     return GradeResult(
         metric="faithfulness",
+        passed=score >= FAITHFULNESS_THRESHOLD,
         score=score,
         detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
     )
+def grade_faithfulness_decomposed(response: str, context: str) -> GradeResult:
+    """Claim-level faithfulness: each sentence verified independently against context.
+    Supported claims / total claims — catches partial hallucinations missed by whole-response NLI.
+    """
+    if _is_refusal(response):
+        return GradeResult(
+            metric="faithfulness", passed=True, score=1.0,
+            detail="Refusal — no factual claims to verify",
+        )
+    raw_chunks = [c.strip() for c in context.split("\n\n") if c.strip()]
+    if not raw_chunks:
+        return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
+    chunks = [_strip_chunk_title(c) for c in raw_chunks]
+    claims = decompose_claims(response)
+    if not claims:
+        return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No claims extracted")
+    sentences = _context_sentences(chunks)
+    model = get_nli_model()
+    claim_results: list[dict[str, Any]] = []
+    for claim in claims:
+        pairs = [(s, claim) for s in sentences]
+        scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
+        entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
+        best = float(entailment.max())
+        claim_results.append({"claim": claim, "score": round(best, 3), "supported": best >= FAITHFULNESS_THRESHOLD})
+    supported = sum(1 for c in claim_results if c["supported"])
+    score = supported / len(claim_results)
+    log.info("Claim decomposition: %d/%d supported (score=%.3f)", supported, len(claim_results), score)
+    return GradeResult(
+        metric="faithfulness",
+        passed=score >= FAITHFULNESS_THRESHOLD,
+        score=score,
+        detail=f"{supported}/{len(claim_results)} claims supported (threshold: {FAITHFULNESS_THRESHOLD})",
+        metadata={"claims": claim_results},
+    )
 def grade_chain_terminology(response: str, client: str) -> GradeResult:
     """Check that the response uses client-specific terms, not rival terminology."""
     result = check_terminology(response, client)
         grade_pii_leakage(response),
         grade_token_budget(response, token_budget),
         grade_answer_relevancy(query, response),
+        grade_faithfulness_decomposed(response, context),
         grade_chain_terminology(response, client),
     ]
     return report

eval/compare_faithfulness.py ADDED Viewed

	@@ -0,0 +1,132 @@

+"""
+Side-by-side comparison: whole-response faithfulness vs claim-level decomposition.
+Each golden-dataset pair is run through both graders using the full domain KB as context
+(simulates retrieval returning all docs — maximum pressure on the NLI signal).
+Output: aligned table with per-pair scores + delta, plus summary distributions.
+Usage:
+    cd /Users/praca/ai-response-validator && .venv/bin/python eval/compare_faithfulness.py
+"""
+import statistics
+import sys
+from pathlib import Path
+import yaml
+sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
+from grader import (
+    FAITHFULNESS_THRESHOLD,
+    grade_faithfulness,
+    grade_faithfulness_decomposed,
+)
+DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
+KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
+def _load_pairs() -> list[dict]:
+    return yaml.safe_load(DATASET_PATH.read_text())["pairs"]
+def _load_kb_context(domain: str) -> str:
+    path = KNOWLEDGE_ROOT / domain / "features.yaml"
+    data = yaml.safe_load(path.read_text())
+    chunks = [
+        f"[{doc['title']}]\n{doc['content'].strip()}"
+        for doc in data["documents"]
+    ]
+    return "\n\n".join(chunks)
+def _fmt(score: float | None) -> str:
+    return f"{score:.3f}" if score is not None else "  —  "
+def run() -> None:
+    pairs = _load_pairs()
+    kb: dict[str, str] = {}
+    print(f"\nFaithfulness comparison — {len(pairs)} golden-dataset pairs")
+    print("Context: full domain KB (all docs, simulating broad retrieval)\n")
+    header = f"{'id':<20}  {'whole':>7}  {'decomp':>7}  {'delta':>7}  {'claims':>6}  {'sup/tot':>7}  note"
+    print(header)
+    print("-" * len(header))
+    whole_scores: list[float] = []
+    decomp_scores: list[float] = []
+    deltas: list[float] = []
+    refusals: list[str] = []
+    for pair in pairs:
+        pid = pair["id"]
+        domain = pair["domain"]
+        response = pair["expected_answer"].strip()
+        if domain not in kb:
+            kb[domain] = _load_kb_context(domain)
+        context = kb[domain]
+        w = grade_faithfulness(response, context)
+        d = grade_faithfulness_decomposed(response, context)
+        if "Refusal" in w.detail:
+            refusals.append(pid)
+            print(f"{pid:<20}  {'REFUSAL':>7}  {'REFUSAL':>7}  {'':>7}  {'':>6}  {'':>7}")
+            continue
+        whole_scores.append(w.score)
+        decomp_scores.append(d.score)
+        delta = d.score - w.score
+        deltas.append(delta)
+        claims_meta = d.metadata.get("claims", [])
+        n_claims = len(claims_meta)
+        n_supported = sum(1 for c in claims_meta if c["supported"])
+        sup_tot = f"{n_supported}/{n_claims}"
+        note = ""
+        if abs(delta) >= 0.15:
+            note = "<-- gap"
+        sign = "+" if delta >= 0 else ""
+        print(
+            f"{pid:<20}  {w.score:>7.3f}  {d.score:>7.3f}  {sign}{delta:>6.3f}  {n_claims:>6}  {sup_tot:>7}  {note}"
+        )
+    print("-" * len(header))
+    print()
+    if whole_scores:
+        print("Score distributions (refusals excluded):\n")
+        for name, scores in [("whole_response", whole_scores), ("decomposed", decomp_scores)]:
+            below = sum(1 for s in scores if s < FAITHFULNESS_THRESHOLD)
+            print(
+                f"  {name:<16}  "
+                f"min={min(scores):.3f}  "
+                f"p25={sorted(scores)[len(scores)//4]:.3f}  "
+                f"median={statistics.median(scores):.3f}  "
+                f"p75={sorted(scores)[3*len(scores)//4]:.3f}  "
+                f"max={max(scores):.3f}  "
+                f"below_threshold={below}/{len(scores)}"
+            )
+        print()
+        neg_delta = sum(1 for d in deltas if d < -0.05)
+        mean_abs = statistics.mean(abs(d) for d in deltas)
+        print(f"  mean |delta|    : {mean_abs:.3f}")
+        print(f"  decomp < whole  : {neg_delta}/{len(deltas)} pairs  (whole-response was optimistic here)")
+        print(f"  threshold       : {FAITHFULNESS_THRESHOLD}")
+    if refusals:
+        print(f"\n  Refusals (auto-pass, excluded from stats): {', '.join(refusals)}")
+    print()
+if __name__ == "__main__":
+    run()

eval/drift.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""
+Drift detection: compare live grader score distributions against the golden-dataset baseline.
+Answers: has answer quality shifted since the reference was established?
+Catches: model updates, KB staleness, query distribution shift, threshold miscalibration.
+Statistical test: KS two-sample (same as Evidently DataDriftPreset for numerical columns).
+  - H0: current and reference are drawn from the same distribution
+  - H1: distributions differ
+  - Drifted if p_value < alpha (default 0.05)
+Reference: golden-dataset expected_answer scores (known-good baseline).
+Current:   in-memory telemetry._events from the running API session.
+Usage:
+    cd /Users/praca/ai-response-validator && .venv/bin/python eval/drift.py
+"""
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+import yaml
+from scipy.stats import ks_2samp
+sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
+from grader import (
+    grade_answer_relevancy,
+    grade_chain_terminology,
+    grade_faithfulness_decomposed,
+    grade_pii_leakage,
+    grade_token_budget,
+)
+DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
+KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
+METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
+ALPHA = 0.05
+MIN_CURRENT_SAMPLES = 5
+@dataclass(slots=True)
+class MetricDrift:
+    metric: str
+    ks_statistic: float
+    p_value: float
+    drifted: bool
+    ref_mean: float
+    cur_mean: float
+    ref_n: int
+    cur_n: int
+def _load_kb_context(domain: str) -> str:
+    path = KNOWLEDGE_ROOT / domain / "features.yaml"
+    data = yaml.safe_load(path.read_text())
+    chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
+    return "\n\n".join(chunks)
+Scores = dict[str, list[float]]
+def build_reference() -> Scores:
+    """Score every golden-dataset pair with all graders."""
+    pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
+    kb: dict[str, str] = {}
+    scores: Scores = {m: [] for m in METRICS}
+    for pair in pairs:
+        response = pair["expected_answer"].strip()
+        domain = pair["domain"]
+        if domain not in kb:
+            kb[domain] = _load_kb_context(domain)
+        scores["pii_leakage"].append(grade_pii_leakage(response).score)
+        scores["token_budget"].append(grade_token_budget(response).score)
+        scores["answer_relevancy"].append(grade_answer_relevancy(pair["question"], response).score)
+        scores["faithfulness"].append(grade_faithfulness_decomposed(response, kb[domain]).score)
+        scores["chain_terminology"].append(grade_chain_terminology(response, pair["client"]).score)
+    return scores
+def build_current() -> Scores:
+    """Pull metric scores from the in-memory telemetry buffer."""
+    import telemetry
+    with telemetry._lock:
+        events = list(telemetry._events)
+    scores: Scores = {m: [] for m in METRICS}
+    for event in events:
+        if "metrics" not in event:
+            continue
+        if any(event["metrics"].get(m) is None for m in METRICS):
+            continue
+        for m in METRICS:
+            scores[m].append(float(event["metrics"][m]))
+    return scores
+def detect_drift(
+    current: Scores,
+    reference: Scores,
+    alpha: float = ALPHA,
+) -> list[MetricDrift]:
+    """Run KS two-sample test per metric. Skips metrics with fewer than MIN_CURRENT_SAMPLES."""
+    results: list[MetricDrift] = []
+    for metric in METRICS:
+        ref_col = reference.get(metric, [])
+        cur_col = current.get(metric, [])
+        if len(cur_col) < MIN_CURRENT_SAMPLES or len(ref_col) == 0:
+            continue
+        import numpy as np
+        ref_arr = np.array(ref_col, dtype=float)
+        cur_arr = np.array(cur_col, dtype=float)
+        stat, pval = ks_2samp(ref_arr, cur_arr)
+        results.append(MetricDrift(
+            metric=metric,
+            ks_statistic=round(float(stat), 4),
+            p_value=round(float(pval), 4),
+            drifted=bool(pval < alpha),
+            ref_mean=round(float(ref_arr.mean()), 4),
+            cur_mean=round(float(cur_arr.mean()), 4),
+            ref_n=len(ref_arr),
+            cur_n=len(cur_arr),
+        ))
+    return results
+def report_drift(results: list[MetricDrift], alpha: float = ALPHA) -> None:
+    header = (
+        f"{'metric':<22}  {'ks_stat':>7}  {'p_value':>7}  {'status':>10}"
+        f"  {'ref_mean':>8}  {'cur_mean':>8}  {'delta':>7}"
+    )
+    print(header)
+    print("-" * len(header))
+    for r in results:
+        status = "DRIFT <--" if r.drifted else "ok"
+        delta = r.cur_mean - r.ref_mean
+        sign = "+" if delta >= 0 else ""
+        print(
+            f"{r.metric:<22}  {r.ks_statistic:>7.4f}  {r.p_value:>7.4f}  {status:>10}"
+            f"  {r.ref_mean:>8.4f}  {r.cur_mean:>8.4f}  {sign}{delta:>6.4f}"
+        )
+    drifted = [r for r in results if r.drifted]
+    print(f"\nOverall: {len(drifted)}/{len(results)} metrics drifted (alpha={alpha})")
+    if drifted:
+        print("\nDrifted metrics:")
+        for r in drifted:
+            direction = "degraded" if r.cur_mean < r.ref_mean else "improved"
+            print(f"  {r.metric}: {direction} ({r.ref_mean:.3f} → {r.cur_mean:.3f})")
+def run() -> None:
+    print("\nBuilding reference distribution from golden-dataset.yaml...")
+    reference = build_reference()
+    ref_n = len(next(iter(reference.values()), []))
+    print(f"Reference: {ref_n} pairs\n")
+    current = build_current()
+    cur_n = len(next(iter(current.values()), []))
+    if cur_n < MIN_CURRENT_SAMPLES:
+        import numpy as np
+        print(
+            f"Current: {cur_n} telemetry event(s) — need ≥{MIN_CURRENT_SAMPLES} to run KS test.\n"
+            f"Start the API and run some queries, then re-run drift.py.\n\n"
+            f"Reference distribution (golden baseline):\n"
+        )
+        for m in METRICS:
+            vals = np.array(reference[m])
+            print(f"  {m:<22}  mean={vals.mean():.3f}  std={vals.std():.3f}  min={vals.min():.3f}  max={vals.max():.3f}")
+        return
+    print(f"Current: {cur_n} telemetry events\n")
+    results = detect_drift(current, reference)
+    if not results:
+        print("No metrics had enough data for KS test.\n")
+        return
+    report_drift(results)
+    print()
+if __name__ == "__main__":
+    run()

eval/simulate_traffic.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""
+Populate telemetry with simulated traffic, then run drift detection.
+Two batches:
+  clean  — golden-dataset expected_answers (should match reference distribution)
+  dirty  — same questions, hallucinated responses (should show faithfulness drift)
+Bypasses the API entirely: runs graders + telemetry.record() directly.
+Usage:
+    cd /Users/praca/ai-response-validator && .venv/bin/python eval/simulate_traffic.py
+"""
+import sys
+import time
+from pathlib import Path
+import yaml
+sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
+import telemetry
+from config import CLIENT_DOMAIN
+from grader import GradeReport, grade
+DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
+KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
+# Hallucinated responses — plausible-sounding but contradicts KB facts
+HALLUCINATED: dict[str, str] = {
+    # retail — NovaMart
+    "retail-nm-001": (
+        "When a product runs out of stock, the system automatically places a reorder after 72 hours "
+        "with no alerts sent to any manager. The supplier is notified only at month-end review."
+    ),
+    "retail-nm-002": (
+        "To add a new supplier, send an email to the procurement team with the company name. "
+        "No tax ID or payment terms are required at this stage. "
+        "Purchase orders can be created immediately without waiting for validation."
+    ),
+    "retail-nm-003": (
+        "Feature flags are permanent once enabled and cannot be disabled without a code deployment. "
+        "There is no expiry date or activation scope. Any employee can enable a flag in production."
+    ),
+    "retail-nm-004": (
+        "The authoritative source for product information is the pricing portal. "
+        "SKU records are updated manually once per week by the merchandising team. "
+        "Archived products can be reactivated instantly by any store manager."
+    ),
+    "retail-nm-005": (
+        "Price changes take effect immediately upon submission with no approval required. "
+        "There is no sync window; prices update in real time. "
+        "Emergency corrections are handled automatically without escalation."
+    ),
+    # retail — ShelfWise
+    "retail-sw-001": (
+        "An out-of-stock alert fires only after a manual stock check is initiated by a store manager. "
+        "The alert is sent exclusively to the regional director. "
+        "No escalation occurs if the alert is unacknowledged."
+    ),
+    "retail-sw-002": (
+        "Feature toggles are permanent once enabled. "
+        "There is no activation scope and no expiry date requirement. "
+        "Any user can enable toggles in production without sign-off."
+    ),
+    "retail-sw-004": (
+        "Compliance reports are editable for up to 30 days after creation and are stored for 2 years. "
+        "Any user can access compliance reports from the standard dashboard. "
+        "Reports are generated on demand only."
+    ),
+    "retail-sw-005": (
+        "Product catalog updates require manual approval for each SKU and can take up to 48 hours. "
+        "Deactivated products are permanently deleted and cannot be recovered."
+    ),
+    # pharma — ClinixOne
+    "pharma-cx-001": (
+        "Prior authorization is optional and payers respond within 7 business days. "
+        "Denied requests cannot be appealed and the prescriber must choose an alternative drug."
+    ),
+    "pharma-cx-003": (
+        "Adverse events must be reported to regulators within 30 days for all event types. "
+        "A safety signal is raised automatically by the system when 3 or more events occur. "
+        "Expected events do not require regulatory reporting."
+    ),
+    "pharma-cx-004": (
+        "Clinical trials have two phases: Phase I for safety and Phase II for market approval. "
+        "Enrollment eligibility is determined by the treating physician with no formal criteria."
+    ),
+    # pharma — PharmaLink
+    "pharma-pl-001": (
+        "Formulary pre-approval is automatically granted for all branded drugs. "
+        "The payer responds within 30 days and denied requests cannot be appealed."
+    ),
+    "pharma-pl-003": (
+        "The formulary has two tiers: generic and branded. "
+        "Moving a drug to a higher tier requires a 7-day notice to prescribers. "
+        "Tier assignment is reviewed every 5 years."
+    ),
+    "pharma-pl-004": (
+        "A prescribing pathway is a marketing document produced by pharmaceutical companies. "
+        "Pathways are reviewed every 5 years and payers do not use them in coverage decisions. "
+        "Deviation from a pathway requires no documentation."
+    ),
+    "pharma-pl-005": (
+        "Enrollment authorization is a formality — patients sign a standard waiver. "
+        "Consent is obtained after the first study procedure, not before. "
+        "Protocol changes do not require re-consent from existing participants."
+    ),
+}
+def _load_kb_context(domain: str) -> str:
+    path = KNOWLEDGE_ROOT / domain / "features.yaml"
+    data = yaml.safe_load(path.read_text())
+    chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
+    return "\n\n".join(chunks)
+def _record(pair: dict, response: str, context: str, tag: str) -> GradeReport:
+    client = pair["client"]
+    report = grade(
+        query=pair["question"],
+        response=response,
+        context=context,
+        client=client,
+    )
+    telemetry.record(
+        client=client,
+        domain=pair["domain"],
+        query_len=len(pair["question"].split()),
+        latency_ms={"retrieve": 12.0, "generate": 180.0, "grade": 45.0},
+        report=report,
+        docs_retrieved=3,
+        min_retrieval_score=0.72,
+    )
+    status = "PASS" if report.overall else "FAIL"
+    faith = next(r for r in report.results if r.metric == "faithfulness")
+    print(f"  [{tag}] {pair['id']:<20} {status}  faith={faith.score:.3f}  {faith.detail}")
+    return report
+def run() -> None:
+    pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
+    kb: dict[str, str] = {}
+    # ── Batch 1: clean traffic ──────────────────────────────────────────────
+    print("\n── Batch 1: clean traffic (expected answers) ──\n")
+    for pair in pairs:
+        domain = pair["domain"]
+        if domain not in kb:
+            kb[domain] = _load_kb_context(domain)
+        response = pair["expected_answer"].strip()
+        _record(pair, response, kb[domain], "clean")
+        time.sleep(0.05)
+    # ── Batch 2: dirty traffic (hallucinated responses) ─────────────────────
+    print("\n── Batch 2: dirty traffic (hallucinated responses) ──\n")
+    dirty_pairs = [p for p in pairs if p["id"] in HALLUCINATED]
+    for pair in dirty_pairs:
+        domain = pair["domain"]
+        response = HALLUCINATED[pair["id"]]
+        _record(pair, response, kb[domain], "dirty")
+        time.sleep(0.05)
+    total = telemetry.live_stats()["total_queries"]
+    print(f"\nTelemetry buffer: {total} events ({len(pairs)} clean + {len(dirty_pairs)} dirty)\n")
+    # ── Drift detection ─────────────────────────────────────────────────────
+    print("=" * 60)
+    print("Running drift detection vs golden-dataset baseline...")
+    print("=" * 60)
+    sys.path.insert(0, str(Path(__file__).parent))
+    from drift import build_current, build_reference, detect_drift, report_drift
+    print("\nBuilding reference distribution...")
+    reference = build_reference()
+    current = build_current()
+    cur_n = len(next(iter(current.values()), []))
+    print(f"Reference: {len(next(iter(reference.values())))} pairs")
+    print(f"Current:   {cur_n} events\n")
+    results = detect_drift(current, reference)
+    report_drift(results)
+    print()
+if __name__ == "__main__":
+    run()

tests/unit/test_drift.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+Unit tests for drift detection — detect_drift() only.
+No model loading, no IO, no telemetry.
+"""
+import sys
+from pathlib import Path
+import numpy as np
+import pytest
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "eval"))
+from drift import ALPHA, MIN_CURRENT_SAMPLES, MetricDrift, detect_drift
+METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
+def _scores(n: int, **col_values: list[float]) -> dict[str, list[float]]:
+    """Build a Scores dict with fixed values per column; defaults to 0.9 for others."""
+    data: dict[str, list[float]] = {}
+    for metric in METRICS:
+        data[metric] = col_values.get(metric, [0.9] * n)
+    return data
+class TestDetectDrift:
+    def test_identical_distributions_no_drift(self) -> None:
+        rng = np.random.default_rng(42)
+        scores = rng.uniform(0.5, 1.0, 50).tolist()
+        ref = _scores(50, faithfulness=scores)
+        cur = _scores(50, faithfulness=scores)
+        results = detect_drift(cur, ref)
+        faith = next(r for r in results if r.metric == "faithfulness")
+        assert faith.drifted is False
+    def test_shifted_distribution_detected(self) -> None:
+        ref = _scores(50, faithfulness=[0.9] * 50)
+        cur = _scores(50, faithfulness=[0.1] * 50)
+        results = detect_drift(cur, ref)
+        faith = next(r for r in results if r.metric == "faithfulness")
+        assert faith.drifted is True
+        assert faith.p_value < ALPHA
+    def test_below_min_samples_excluded(self) -> None:
+        ref = _scores(50)
+        cur = _scores(MIN_CURRENT_SAMPLES - 1)
+        results = detect_drift(cur, ref)
+        assert results == []
+    def test_exactly_min_samples_included(self) -> None:
+        ref = _scores(50)
+        cur = _scores(MIN_CURRENT_SAMPLES)
+        results = detect_drift(cur, ref)
+        assert len(results) == len(METRICS)
+    def test_ks_statistic_in_range(self) -> None:
+        ref = _scores(50, faithfulness=[0.9] * 50)
+        cur = _scores(50, faithfulness=[0.1] * 50)
+        results = detect_drift(cur, ref)
+        faith = next(r for r in results if r.metric == "faithfulness")
+        assert 0.0 <= faith.ks_statistic <= 1.0
+    def test_means_computed_correctly(self) -> None:
+        ref = _scores(10, faithfulness=[0.8] * 10)
+        cur = _scores(10, faithfulness=[0.4] * 10)
+        results = detect_drift(cur, ref)
+        faith = next(r for r in results if r.metric == "faithfulness")
+        assert faith.ref_mean == pytest.approx(0.8, abs=1e-3)
+        assert faith.cur_mean == pytest.approx(0.4, abs=1e-3)
+    def test_all_metrics_returned(self) -> None:
+        ref = _scores(30)
+        cur = _scores(30)
+        result_names = {r.metric for r in detect_drift(cur, ref)}
+        assert result_names == set(METRICS)
+    def test_result_is_metric_drift_dataclass(self) -> None:
+        ref = _scores(20)
+        cur = _scores(20)
+        for r in detect_drift(cur, ref):
+            assert isinstance(r, MetricDrift)
+            assert isinstance(r.drifted, bool)
+            assert isinstance(r.ks_statistic, float)
+            assert isinstance(r.p_value, float)
+    def test_custom_alpha_respected(self) -> None:
+        rng = np.random.default_rng(0)
+        ref = _scores(50, faithfulness=rng.uniform(0.7, 1.0, 50).tolist())
+        cur = _scores(50, faithfulness=rng.uniform(0.4, 0.7, 50).tolist())
+        strict = detect_drift(cur, ref, alpha=0.001)
+        lenient = detect_drift(cur, ref, alpha=0.999)
+        faith_strict = next(r for r in strict if r.metric == "faithfulness")
+        faith_lenient = next(r for r in lenient if r.metric == "faithfulness")
+        assert faith_lenient.drifted or not faith_strict.drifted
+    def test_missing_metric_column_skipped(self) -> None:
+        ref: dict[str, list[float]] = {"faithfulness": [0.9] * 20}
+        cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
+        results = detect_drift(cur, ref)
+        assert all(r.metric == "faithfulness" for r in results)
+        assert len(results) == 1
+    def test_empty_reference_skipped(self) -> None:
+        ref: dict[str, list[float]] = {"faithfulness": []}
+        cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
+        results = detect_drift(cur, ref)
+        assert results == []
+    def test_sample_counts_in_result(self) -> None:
+        ref = _scores(30)
+        cur = _scores(10)
+        results = detect_drift(cur, ref)
+        for r in results:
+            assert r.ref_n == 30
+            assert r.cur_n == 10

tests/unit/test_grader.py CHANGED Viewed

@@ -11,10 +11,17 @@ import pytest
 sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
 from grader import (
     grade_pii_leakage,
     grade_token_budget,
     grade_chain_terminology,
     TOKEN_BUDGET,
 )
@@ -138,3 +145,111 @@ class TestChainTerminology:
         )
         assert result.passed is False
         assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])

 sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
+from unittest.mock import MagicMock, patch
+import numpy as np
 from grader import (
     grade_pii_leakage,
     grade_token_budget,
     grade_chain_terminology,
+    decompose_claims,
+    grade_faithfulness_decomposed,
+    FAITHFULNESS_THRESHOLD,
     TOKEN_BUDGET,
 )
         )
         assert result.passed is False
         assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])
+# ── decompose_claims ──────────────────────────────────────────────────────────
+class TestDecomposeClaims:
+    def test_single_sentence(self) -> None:
+        claims = decompose_claims("The product is in stock.")
+        assert claims == ["The product is in stock."]
+    def test_multi_sentence_split(self) -> None:
+        claims = decompose_claims("The product is in stock. It costs five dollars. Delivery takes two days.")
+        assert len(claims) == 3
+    def test_fragments_under_three_words_excluded(self) -> None:
+        claims = decompose_claims("Yes. The product is available in all sizes.")
+        assert all(len(c.split()) >= 3 for c in claims)
+    def test_exclamation_and_question_split(self) -> None:
+        claims = decompose_claims("Stock is low! Would you like to reorder? The threshold is five units.")
+        assert len(claims) == 3
+    def test_empty_string_returns_empty(self) -> None:
+        assert decompose_claims("") == []
+# ── grade_faithfulness_decomposed ────────────────────────────────────────────
+def _make_nli(entailment: float) -> MagicMock:
+    """Mock CrossEncoder whose predict() always returns the given entailment score."""
+    mock = MagicMock()
+    # columns: [contradiction, entailment, neutral]
+    mock.predict = MagicMock(
+        side_effect=lambda pairs, **kw: np.array([[0.1, entailment, 0.0]] * len(pairs))
+    )
+    return mock
+CONTEXT = "The product costs five dollars.\n\nDelivery takes two days."
+class TestGradeFaithfulnessDecomposed:
+    def test_all_claims_supported_passes(self) -> None:
+        with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
+            result = grade_faithfulness_decomposed(
+                "The product costs five dollars. Delivery takes two days.", CONTEXT
+            )
+        assert result.passed is True
+        assert result.score == 1.0
+        assert result.metadata["claims"][0]["supported"] is True
+    def test_all_claims_unsupported_fails(self) -> None:
+        with patch("grader.get_nli_model", return_value=_make_nli(0.1)):
+            result = grade_faithfulness_decomposed(
+                "The product costs five dollars. Delivery takes two days.", CONTEXT
+            )
+        assert result.passed is False
+        assert result.score == 0.0
+    def test_partial_hallucination_detected(self) -> None:
+        # first claim supported, second not — whole-response NLI would miss this
+        call_count = 0
+        def side_effect(pairs: list, **kw: object) -> np.ndarray:
+            nonlocal call_count
+            call_count += 1
+            entailment = 0.9 if call_count == 1 else 0.1
+            return np.array([[0.1, entailment, 0.0]] * len(pairs))
+        mock_model = MagicMock()
+        mock_model.predict = MagicMock(side_effect=side_effect)
+        with patch("grader.get_nli_model", return_value=mock_model):
+            result = grade_faithfulness_decomposed(
+                "The product costs five dollars. It was invented in 1842.", CONTEXT
+            )
+        assert result.score == 0.5
+        assert result.metadata["claims"][0]["supported"] is True
+        assert result.metadata["claims"][1]["supported"] is False
+    def test_refusal_auto_passes(self) -> None:
+        result = grade_faithfulness_decomposed(
+            "I don't have enough information to answer that.", CONTEXT
+        )
+        assert result.passed is True
+        assert result.score == 1.0
+    def test_empty_context_fails(self) -> None:
+        with patch("grader.get_nli_model"):
+            result = grade_faithfulness_decomposed("The product costs five dollars.", "")
+        assert result.passed is False
+        assert result.score == 0.0
+    def test_metadata_shape(self) -> None:
+        with patch("grader.get_nli_model", return_value=_make_nli(0.8)):
+            result = grade_faithfulness_decomposed(
+                "The product is available. It ships in two days.", CONTEXT
+            )
+        for entry in result.metadata["claims"]:
+            assert "claim" in entry
+            assert "score" in entry
+            assert "supported" in entry
+    def test_score_is_proportion_not_max(self) -> None:
+        """Verify score = supported/total, not max(entailment_scores)."""
+        with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
+            result = grade_faithfulness_decomposed(
+                "Claim one is true. Claim two is also true. Claim three too.", CONTEXT
+            )
+        assert result.score == 1.0