mbochniak01 Claude Sonnet 4.6 commited on
Commit
ffbf46f
Β·
1 Parent(s): e181667

Replace HHEM with sentence-level NLI, add claim decomposition and drift detection

Browse files

Faithfulness grader:
- Replace broken Vectara HHEM v2 (zero embedding matrix) with cross-encoder/nli-deberta-v3-small
- Add decompose_claims() β€” splits response into atomic sentences for per-claim verification
- Add _context_sentences() β€” splits KB chunks into sentences before NLI scoring; fixes
paragraph-level entailment collapse (verbatim text was scoring ent=0.002 at paragraph level,
ent=0.995 at sentence level including aliased terms like "item registry" vs "product catalog")
- grade_faithfulness_decomposed() promoted to default in grade(); score = supported/total claims

Drift detection:
- eval/drift.py: KS two-sample test per metric vs golden-dataset baseline
- eval/compare_faithfulness.py: side-by-side whole-response vs claim-level scores
- eval/simulate_traffic.py: clean + hallucinated traffic simulation for drift testing
- tests/unit/test_drift.py: 12 unit tests for detect_drift()

Docs updated to reflect all changes (ARCHITECTURE.md, NOTES.md, README.md)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ARCHITECTURE.md CHANGED
@@ -60,7 +60,7 @@ Runs inline with every request. No ground truth required.
60
  | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β€” fails hard |
61
  | `token_budget` | Char count Γ· 4 | ≀ 512 tokens | Conciseness enforcement |
62
  | `answer_relevancy` | Cosine similarity (bi-encoder) | β‰₯ 0.45 | On-topic detection |
63
- | `faithfulness` | Vectara HHEM v2 cross-encoder | β‰₯ 0.35 | Hallucination detection |
64
  | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
65
 
66
  ### L2 β€” Batch (local, against golden dataset)
@@ -68,10 +68,14 @@ Runs inline with every request. No ground truth required.
68
  ```bash
69
  python eval/metrics.py --domain retail
70
  python eval/metrics.py --client novamart --out results.json
 
 
 
71
  ```
72
 
73
- Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
74
- on top of L1 metrics to verify factual completeness against reference answers.
 
75
 
76
  ---
77
 
@@ -85,25 +89,25 @@ Two fundamentally different model architectures serve different roles in this sy
85
  |---|---|---|
86
  | **How it works** | Encodes query and document independently β†’ compare embeddings | Encodes query + document jointly β†’ single relevance score |
87
  | **Speed** | Fast β€” embeddings pre-computed at index build time | Slow β€” must re-encode every (query, doc) pair at inference |
88
- | **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment |
89
- | **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) |
90
 
91
  **Measured overhead (CPU, HF Spaces):**
92
 
93
  | Step | Model | Typical latency |
94
  |------|-------|----------------|
95
  | Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
96
- | KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms |
97
  | Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
98
- | Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300–600 ms |
99
- | Total grading overhead | β€” | ~350–650 ms |
100
 
101
  **Why bi-encoder for retrieval:** query time is constant regardless of KB size because
102
  document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
103
  query latency β€” only index build time grows.
104
 
105
  **Why cross-encoder for faithfulness:** cross-encoders see both the document and the
106
- response simultaneously, capturing entailment relationships bi-encoders miss. A response
107
  can be semantically similar to a document (high cosine) while still hallucinating specific
108
  facts β€” the cross-encoder catches this, the bi-encoder does not.
109
 
@@ -124,22 +128,23 @@ It flags rival-client terms appearing without the correct client term.
124
  **Why this matters:** in production multi-tenant AI systems, terminology leakage
125
  between clients is a real failure mode. This catches it mechanically.
126
 
127
- ### Faithfulness via Vectara HHEM v2
128
 
129
- The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) β€”
130
- a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
131
- It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
132
- the response is factually consistent with the document.
133
 
134
- **Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic;
135
- requires prompt engineering to produce consistent scores. A purpose-built cross-encoder
136
- is faster, cheaper, and more consistent for this specific task.
137
 
138
- **Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment
139
- benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from
140
- a premise β€” a different task. Correct, grounded answers score near zero on NLI entailment,
141
- causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG
142
- systems, which maps directly to this use case.
 
 
 
 
143
 
144
  ### In-memory semantic retrieval
145
 
@@ -193,8 +198,12 @@ knowledge/
193
  features.yaml KB documents for retrieval
194
 
195
  eval/
196
- golden-dataset.yaml 20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
197
- metrics.py L2 batch runner β€” CLI, keyphrase scoring, HTML report
 
 
 
 
198
 
199
  ui/
200
  index.html Chat interface + eval panel
@@ -207,7 +216,10 @@ ui/
207
 
208
  | Decision | Alternative | Why this |
209
  |----------|-------------|----------|
210
- | Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic |
 
 
 
211
  | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
212
  | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
213
  | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
@@ -219,10 +231,35 @@ ui/
219
 
220
  ## Evaluation coverage vs RAGAS
221
 
222
- | RAGAS metric | Coverage |
223
- |---|---|
224
- | faithfulness | βœ“ L1 (Claude judge) |
225
- | answer_relevancy | βœ“ L1 (cosine) + L2 (keyphrase) |
226
- | context_precision | partial β€” retrieval score visible in UI |
227
- | context_recall | βœ“ L2 (keyphrase coverage) |
228
- | answer_correctness | βœ“ L2 (keyphrase + expected_answer) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β€” fails hard |
61
  | `token_budget` | Char count Γ· 4 | ≀ 512 tokens | Conciseness enforcement |
62
  | `answer_relevancy` | Cosine similarity (bi-encoder) | β‰₯ 0.45 | On-topic detection |
63
+ | `faithfulness` | Claim decomposition + sentence-level NLI | β‰₯ 0.35 (proportion) | Hallucination detection, claim-level granularity |
64
  | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
65
 
66
  ### L2 β€” Batch (local, against golden dataset)
 
68
  ```bash
69
  python eval/metrics.py --domain retail
70
  python eval/metrics.py --client novamart --out results.json
71
+ python eval/calibrate.py # threshold distribution on golden answers
72
+ python eval/compare_faithfulness.py # whole-response vs claim-level side-by-side
73
+ python eval/drift.py # KS drift detection vs golden baseline
74
  ```
75
 
76
+ Runs golden pairs through the full pipeline. Adds keyphrase coverage scoring on top of
77
+ L1 metrics to verify factual completeness against reference answers. `drift.py` compares
78
+ live telemetry score distributions against the golden baseline using KS two-sample tests.
79
 
80
  ---
81
 
 
89
  |---|---|---|
90
  | **How it works** | Encodes query and document independently β†’ compare embeddings | Encodes query + document jointly β†’ single relevance score |
91
  | **Speed** | Fast β€” embeddings pre-computed at index build time | Slow β€” must re-encode every (query, doc) pair at inference |
92
+ | **Quality** | Good for retrieval: finds semantically similar docs | Better for NLI: captures fine-grained entailment between short sequences |
93
+ | **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (`nli-deberta-v3-small`) |
94
 
95
  **Measured overhead (CPU, HF Spaces):**
96
 
97
  | Step | Model | Typical latency |
98
  |------|-------|----------------|
99
  | Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
100
+ | KB cosine search | numpy matrix multiply | ~2 ms |
101
  | Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
102
+ | Faithfulness (N claim Γ— M sentence pairs) | cross-encoder NLI | ~200–500 ms |
103
+ | Total grading overhead | β€” | ~250–550 ms |
104
 
105
  **Why bi-encoder for retrieval:** query time is constant regardless of KB size because
106
  document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
107
  query latency β€” only index build time grows.
108
 
109
  **Why cross-encoder for faithfulness:** cross-encoders see both the document and the
110
+ claim simultaneously, capturing entailment relationships bi-encoders miss. A response
111
  can be semantically similar to a document (high cosine) while still hallucinating specific
112
  facts β€” the cross-encoder catches this, the bi-encoder does not.
113
 
 
128
  **Why this matters:** in production multi-tenant AI systems, terminology leakage
129
  between clients is a real failure mode. This catches it mechanically.
130
 
131
+ ### Faithfulness: claim decomposition + sentence-level NLI
132
 
133
+ The faithfulness grader (`grade_faithfulness_decomposed`) uses a three-step pipeline:
 
 
 
134
 
135
+ 1. **Claim decomposition** β€” response split into individual sentences via regex. Each sentence is an atomic claim to verify independently.
136
+ 2. **Context sentence splitting** β€” KB chunks split into individual sentences before scoring. NLI cross-encoders are calibrated on sentence-pair inputs (SNLI/MNLI training format); paragraph-level inputs degrade performance significantly (verbatim text scores near 0.002 entailment when the context is a 3–4 sentence paragraph).
137
+ 3. **Per-claim NLI scoring** β€” each claim scored against every context sentence. Claim is "supported" if max entailment β‰₯ threshold. Score = supported_claims / total_claims.
138
 
139
+ **Model:** `cross-encoder/nli-deberta-v3-small` β€” 3-class NLI (contradiction / entailment / neutral). Entailment column used. Sentence-level inputs give `ent β‰₯ 0.98` for verbatim and aliased claims ("item registry" vs "product catalog (item registry)").
140
+
141
+ **Why claim-level not whole-response NLI:** whole-response NLI misses partial hallucinations. A 4-sentence response with 3 correct sentences and 1 fabricated one scores high because the model finds one well-grounded sentence. Claim-level scores 3/4 = 0.75 and exposes the fabrication in metadata.
142
+
143
+ **Why sentence-level context not paragraph-level:** NLI cross-encoders are trained on single (premise, hypothesis) sentence pairs. Feeding a paragraph as premise causes entailment scores to collapse β€” the model distributes probability mass across longer sequences in ways not seen during training. Sentence-level splitting resolves alias mismatches too: `"pricing sync"` vs `"Price updates (pricing syncs) must be submitted..."` scores `ent=0.986` at sentence level.
144
+
145
+ **Why not Claude-as-judge for L1:** adds API cost and latency per query; non-deterministic. The cross-encoder handles L1; LLM-as-judge belongs in L2 batch evaluation for authoritative ground-truth comparison.
146
+
147
+ **Why not Vectara HHEM v2:** HHEM v2 checkpoint is missing `t5.transformer.encoder.embed_tokens.weight` β€” the embedding matrix is zero-initialized, producing a constant 0.502 probability for every input regardless of content. Diagnosed via `embed_tokens.std() == 0.0`.
148
 
149
  ### In-memory semantic retrieval
150
 
 
198
  features.yaml KB documents for retrieval
199
 
200
  eval/
201
+ golden-dataset.yaml 24 Q&A pairs (20 standard + 4 adversarial edge cases)
202
+ metrics.py L2 batch runner β€” CLI, keyphrase scoring, HTML report
203
+ calibrate.py Threshold calibration β€” score distributions on golden answers
204
+ compare_faithfulness.py Side-by-side: whole-response vs claim-level faithfulness scores
205
+ drift.py KS drift detection β€” live telemetry vs golden baseline
206
+ simulate_traffic.py Populate telemetry with clean + hallucinated traffic for drift testing
207
 
208
  ui/
209
  index.html Chat interface + eval panel
 
216
 
217
  | Decision | Alternative | Why this |
218
  |----------|-------------|----------|
219
+ | `nli-deberta-v3-small` + sentence splitting | Vectara HHEM v2 / Claude-as-judge | HHEM broken (zero embeddings); DeBERTa works at sentence level; no API cost |
220
+ | Claim-level faithfulness (proportion) | Whole-response NLI | Whole-response misses partial hallucinations; claim-level exposes them in metadata |
221
+ | Sentence-level context splitting | Full paragraph as NLI premise | NLI models calibrated on sentence pairs; paragraph inputs collapse entailment scores |
222
+ | KS two-sample test for drift | Evidently DataDriftPreset | Same statistical test, no extra dependency (scipy via scikit-learn) |
223
  | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
224
  | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
225
  | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
 
231
 
232
  ## Evaluation coverage vs RAGAS
233
 
234
+ | RAGAS metric | Coverage | Notes |
235
+ |---|---|---|
236
+ | faithfulness | βœ“ L1 (claim-level NLI) | `grade_faithfulness_decomposed()` β€” sentence-level cross-encoder |
237
+ | answer_relevancy | βœ“ L1 (cosine) + L2 (keyphrase) | Bi-encoder cosine; LLM-based in L2 |
238
+ | context_precision | partial β€” retrieval score in UI | No rank-weighted precision@k |
239
+ | context_recall | βœ“ L2 (keyphrase coverage) | Keyphrases as proxy for claim coverage |
240
+ | answer_correctness | βœ“ L2 (keyphrase + expected_answer) | |
241
+
242
+ ## Drift detection
243
+
244
+ `eval/drift.py` detects distribution shift in grader scores between live traffic and the golden-dataset baseline.
245
+
246
+ ```
247
+ reference = build_reference() # run all graders on golden-dataset expected_answers
248
+ current = build_current() # pull metric scores from telemetry._events
249
+
250
+ results = detect_drift(current, reference, alpha=0.05)
251
+ # β†’ per-metric: ks_statistic, p_value, drifted, ref_mean, cur_mean, delta
252
+ ```
253
+
254
+ **Statistical test:** KS two-sample (Kolmogorov–Smirnov). Same test as Evidently `DataDriftPreset` for numerical columns. Detects any shift in distribution shape, not just mean change.
255
+
256
+ **Sensitivity:** with n_ref=24 golden pairs, KS test reaches p < 0.05 at ~40% traffic degradation (n_cur=40+). Smaller effects require larger current sample windows.
257
+
258
+ **What each metric's drift signals:**
259
+
260
+ | Metric | Drift means |
261
+ |--------|-------------|
262
+ | `faithfulness` | Model hallucinating more / KB stale / retrieval returning wrong docs |
263
+ | `answer_relevancy` | Query distribution shifted / model off-topic |
264
+ | `chain_terminology` | Terminology catalog misaligned with model outputs |
265
+ | `pii_leakage` / `token_budget` | Structural output format changed |
NOTES.md CHANGED
@@ -43,6 +43,34 @@ teardown fixtures.
43
 
44
  ---
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ## Alternative judge approaches considered
47
 
48
  ### Ollama (local LLM judge)
@@ -59,12 +87,12 @@ dependency on HF token entirely and allow offline eval runs.
59
  in a structured format designed for rubric-based grading. It's a drop-in replacement
60
  for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
61
  for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
62
- The tradeoff vs. the current Vectara HHEM v2 approach: Prometheus is slower (7B vs
63
  purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
64
  which is more interpretable for audit and debugging.
65
 
66
- **Why not used here:** HHEM v2 runs faster and requires no prompt engineering.
67
- Prometheus would be the right choice if rationale logging is a compliance requirement.
68
 
69
  ---
70
 
@@ -88,14 +116,18 @@ Prometheus would be the right choice if rationale logging is a compliance requir
88
 
89
  - **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
90
  enough information" responses and returns score=1.0 β€” no factual claims, trivially faithful.
91
- - **Partial grounding blind spot**: faithfulness now uses sentence-level min-score (weakest
92
- link wins) instead of max-score across chunks. A response with one hallucinated sentence
93
- now fails even if other sentences are grounded.
 
94
  - **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
95
  log entry and sets `flagged: true` in the response payload. UI shows a red banner.
96
  - **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
97
  - **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
98
  prompt injection, multi-doc synthesis, hallucination bait).
 
 
 
99
 
100
  ---
101
 
 
43
 
44
  ---
45
 
46
+ ## NLI model selection β€” what was tried and why
47
+
48
+ The faithfulness grader went through three models before converging:
49
+
50
+ **Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) β€” purpose-built for RAG
51
+ faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
52
+ is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
53
+ zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
54
+ Diagnosed via weight inspection, not error message.
55
+
56
+ **`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) β€” 3-class NLI
57
+ (contradiction / entailment / neutral). Correct model family, wrong input format.
58
+ NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4
59
+ sentence KB paragraph as the premise causes entailment scores to collapse β€” verbatim
60
+ text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
61
+ across longer sequences in ways not seen during training.
62
+
63
+ **`cross-encoder/nli-deberta-v3-small` (sentence-level)** β€” same model, fixed by splitting
64
+ KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
65
+ ("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
66
+ `entβ‰ˆ0.000`, contradictionβ‰ˆ1.0. This is the current implementation.
67
+
68
+ **Key insight:** the NLI model selection problem is a data format problem as much as a
69
+ model selection problem. The same model produces correct results at sentence level and
70
+ degenerate results at paragraph level.
71
+
72
+ ---
73
+
74
  ## Alternative judge approaches considered
75
 
76
  ### Ollama (local LLM judge)
 
87
  in a structured format designed for rubric-based grading. It's a drop-in replacement
88
  for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
89
  for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
90
+ The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
91
  purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
92
  which is more interpretable for audit and debugging.
93
 
94
+ **Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt
95
+ engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.
96
 
97
  ---
98
 
 
116
 
117
  - **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
118
  enough information" responses and returns score=1.0 β€” no factual claims, trivially faithful.
119
+ - **Partial grounding blind spot**: faithfulness now uses claim-level decomposition
120
+ (`grade_faithfulness_decomposed`). Response split into sentences; each verified
121
+ independently. Score = supported_claims / total_claims. A response with one hallucinated
122
+ sentence in three now scores 0.667, not 1.0.
123
  - **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
124
  log entry and sets `flagged: true` in the response payload. UI shows a red banner.
125
  - **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
126
  - **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
127
  prompt injection, multi-doc synthesis, hallucination bait).
128
+ - **No drift detection**: added `eval/drift.py` β€” KS two-sample test per metric, compares
129
+ live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
130
+ at p < 0.05 with ~40% traffic degradation across 40+ events.
131
 
132
  ---
133
 
README.md CHANGED
@@ -57,13 +57,23 @@ All tests are stateless β€” no cleanup required.
57
  ## Batch evaluation (L2)
58
 
59
  ```bash
60
- make eval-retail # evaluate 10 retail Q&A pairs, open HTML report
61
- make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report
62
- make eval # all 20 pairs
63
  ```
64
 
65
  Reports are written to `eval/reports/`.
66
 
 
 
 
 
 
 
 
 
 
 
67
  ---
68
 
69
  ## Code quality
@@ -133,8 +143,9 @@ See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency
133
  | PII Leakage | L1 live | Regex scan β€” binary |
134
  | Token Budget | L1 live | Char count Γ· 4 |
135
  | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
136
- | Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
137
  | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
138
  | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
 
139
 
140
  **Core principle:** no single metric proves correctness. The combination does.
 
57
  ## Batch evaluation (L2)
58
 
59
  ```bash
60
+ make eval-retail # evaluate retail Q&A pairs, open HTML report
61
+ make eval-pharma # evaluate pharma Q&A pairs, open HTML report
62
+ make eval # all domains
63
  ```
64
 
65
  Reports are written to `eval/reports/`.
66
 
67
+ **Drift detection** (no server required):
68
+
69
+ ```bash
70
+ python eval/simulate_traffic.py # populate telemetry + run drift report
71
+ python eval/drift.py # drift report against live telemetry
72
+ ```
73
+
74
+ Compares live grader score distributions against the golden-dataset baseline using KS tests.
75
+ Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.
76
+
77
  ---
78
 
79
  ## Code quality
 
143
  | PII Leakage | L1 live | Regex scan β€” binary |
144
  | Token Budget | L1 live | Char count Γ· 4 |
145
  | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
146
+ | Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
147
  | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
148
  | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
149
+ | Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |
150
 
151
  **Core principle:** no single metric proves correctness. The combination does.
backend/grader.py CHANGED
@@ -5,7 +5,7 @@ Metrics:
5
  pii_leakage β€” regex scan for PII patterns in response
6
  token_budget β€” response within allowed token ceiling
7
  answer_relevancy β€” cosine similarity between query and response embeddings
8
- faithfulness β€” Vectara HHEM v2: RAG faithfulness probability per (doc, response) pair
9
  chain_terminology β€” deterministic: client-specific terms used (via RosettaStone)
10
  """
11
 
@@ -14,19 +14,20 @@ import re
14
  from dataclasses import dataclass, field
15
  from typing import Any
16
 
 
17
  from config import EMBEDDER_MODEL
18
  from rosetta import check_terminology
19
- from sentence_transformers import SentenceTransformer
20
  from sklearn.metrics.pairwise import cosine_similarity
21
- from transformers import T5Tokenizer
22
- from transformers import pipeline as hf_pipeline
23
 
24
  log = logging.getLogger(__name__)
25
 
26
  _embedder: SentenceTransformer | None = None
27
- _nli_model: Any = None
28
 
29
- NLI_MODEL = "vectara/hallucination_evaluation_model"
 
 
30
 
31
 
32
  def get_embedder() -> SentenceTransformer:
@@ -37,31 +38,11 @@ def get_embedder() -> SentenceTransformer:
37
  return _embedder
38
 
39
 
40
- def get_nli_model() -> Any:
41
- """Return the shared Vectara faithfulness pipeline, loading it on first call."""
42
  global _nli_model
43
  if _nli_model is None:
44
- # HHEMv2 doesn't call post_init() in __init__, so all_tied_weights_keys is never
45
- # set β€” transformers 5.x requires it in _finalize_model_loading. Patch before load.
46
- from transformers import PreTrainedModel
47
- _orig = PreTrainedModel.mark_tied_weights_as_initialized
48
- def _patched(self: Any, loading_info: Any) -> None:
49
- if not hasattr(self, "all_tied_weights_keys"):
50
- self.all_tied_weights_keys = {}
51
- _orig(self, loading_info) # type: ignore[no-untyped-call]
52
- PreTrainedModel.mark_tied_weights_as_initialized = _patched # type: ignore[method-assign]
53
-
54
- tokenizer = T5Tokenizer.from_pretrained("t5-small")
55
- _nli_model = hf_pipeline(
56
- "text-classification",
57
- model=NLI_MODEL,
58
- tokenizer=tokenizer,
59
- trust_remote_code=True,
60
- truncation=True,
61
- max_length=512,
62
- )
63
-
64
- PreTrainedModel.mark_tied_weights_as_initialized = _orig # type: ignore[method-assign]
65
  return _nli_model
66
 
67
 
@@ -95,6 +76,8 @@ class GradeReport:
95
  }
96
 
97
 
 
 
98
  _PII_PATTERNS = [
99
  (r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
100
  (r"\b\d{16}\b", "credit card"),
@@ -163,8 +146,28 @@ def _strip_chunk_title(chunk: str) -> str:
163
  return chunk
164
 
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  def grade_faithfulness(response: str, context: str) -> GradeResult:
167
- """Faithfulness via Vectara hallucination model: scores (document, response) pairs directly."""
168
  if _is_refusal(response):
169
  return GradeResult(
170
  metric="faithfulness", passed=True, score=1.0,
@@ -175,24 +178,63 @@ def grade_faithfulness(response: str, context: str) -> GradeResult:
175
  if not raw_chunks:
176
  return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
177
  chunks = [_strip_chunk_title(c) for c in raw_chunks]
178
- # text_pair encodes sequences with T5 </s> separator β€” correct for T5-based models.
179
- pairs = [{"text": chunk, "text_pair": response} for chunk in chunks]
180
- results = model(pairs)
181
- log.info("Vectara raw: %s", [(r["label"], round(r["score"], 3)) for r in results])
182
- scores = [
183
- r["score"] if r["label"].lower().startswith("factually consistent") else 1.0 - r["score"]
184
- for r in results
185
- ]
186
- score = float(max(scores))
187
- passed = score >= FAITHFULNESS_THRESHOLD
188
  return GradeResult(
189
  metric="faithfulness",
190
- passed=passed,
191
  score=score,
192
  detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
193
  )
194
 
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  def grade_chain_terminology(response: str, client: str) -> GradeResult:
197
  """Check that the response uses client-specific terms, not rival terminology."""
198
  result = check_terminology(response, client)
@@ -226,7 +268,7 @@ def grade(
226
  grade_pii_leakage(response),
227
  grade_token_budget(response, token_budget),
228
  grade_answer_relevancy(query, response),
229
- grade_faithfulness(response, context),
230
  grade_chain_terminology(response, client),
231
  ]
232
  return report
 
5
  pii_leakage β€” regex scan for PII patterns in response
6
  token_budget β€” response within allowed token ceiling
7
  answer_relevancy β€” cosine similarity between query and response embeddings
8
+ faithfulness β€” NLI cross-encoder: entailment score per (chunk, claim) pair
9
  chain_terminology β€” deterministic: client-specific terms used (via RosettaStone)
10
  """
11
 
 
14
  from dataclasses import dataclass, field
15
  from typing import Any
16
 
17
+ import numpy as np
18
  from config import EMBEDDER_MODEL
19
  from rosetta import check_terminology
20
+ from sentence_transformers import CrossEncoder, SentenceTransformer
21
  from sklearn.metrics.pairwise import cosine_similarity
 
 
22
 
23
  log = logging.getLogger(__name__)
24
 
25
  _embedder: SentenceTransformer | None = None
26
+ _nli_model: CrossEncoder | None = None
27
 
28
+ # cross-encoder/nli-deberta-v3-small: 3-class NLI, columns = [contradiction, entailment, neutral]
29
+ NLI_MODEL = "cross-encoder/nli-deberta-v3-small"
30
+ _NLI_ENTAILMENT_IDX = 1
31
 
32
 
33
  def get_embedder() -> SentenceTransformer:
 
38
  return _embedder
39
 
40
 
41
+ def get_nli_model() -> CrossEncoder:
42
+ """Return the shared NLI cross-encoder, loading it on first call."""
43
  global _nli_model
44
  if _nli_model is None:
45
+ _nli_model = CrossEncoder(NLI_MODEL)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  return _nli_model
47
 
48
 
 
76
  }
77
 
78
 
79
+ _SENTENCE_SPLIT = re.compile(r"(?<=[.!?])\s+")
80
+
81
  _PII_PATTERNS = [
82
  (r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
83
  (r"\b\d{16}\b", "credit card"),
 
146
  return chunk
147
 
148
 
149
+ def decompose_claims(response: str) -> list[str]:
150
+ """Split response into atomic claim sentences (β‰₯3 words each)."""
151
+ sentences = _SENTENCE_SPLIT.split(response.strip())
152
+ return [s.strip() for s in sentences if len(s.split()) >= 3]
153
+
154
+
155
+ def _context_sentences(chunks: list[str]) -> list[str]:
156
+ """Flatten context chunks into individual sentences for sentence-level NLI scoring.
157
+
158
+ Cross-encoder NLI degrades on multi-sentence inputs β€” performance is calibrated
159
+ on single-sentence (premise, hypothesis) pairs matching the SNLI/MNLI training format.
160
+ """
161
+ sentences = []
162
+ for chunk in chunks:
163
+ for s in _SENTENCE_SPLIT.split(chunk.strip()):
164
+ if len(s.split()) >= 3:
165
+ sentences.append(s.strip())
166
+ return sentences
167
+
168
+
169
  def grade_faithfulness(response: str, context: str) -> GradeResult:
170
+ """Whole-response faithfulness: max entailment score across all context chunks."""
171
  if _is_refusal(response):
172
  return GradeResult(
173
  metric="faithfulness", passed=True, score=1.0,
 
178
  if not raw_chunks:
179
  return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
180
  chunks = [_strip_chunk_title(c) for c in raw_chunks]
181
+ sentences = _context_sentences(chunks)
182
+ pairs = [(s, response) for s in sentences]
183
+ scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
184
+ entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
185
+ log.info("NLI entailment scores: %s", [round(float(s), 3) for s in entailment])
186
+ score = float(entailment.max())
 
 
 
 
187
  return GradeResult(
188
  metric="faithfulness",
189
+ passed=score >= FAITHFULNESS_THRESHOLD,
190
  score=score,
191
  detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
192
  )
193
 
194
 
195
+ def grade_faithfulness_decomposed(response: str, context: str) -> GradeResult:
196
+ """Claim-level faithfulness: each sentence verified independently against context.
197
+
198
+ Supported claims / total claims β€” catches partial hallucinations missed by whole-response NLI.
199
+ """
200
+ if _is_refusal(response):
201
+ return GradeResult(
202
+ metric="faithfulness", passed=True, score=1.0,
203
+ detail="Refusal β€” no factual claims to verify",
204
+ )
205
+ raw_chunks = [c.strip() for c in context.split("\n\n") if c.strip()]
206
+ if not raw_chunks:
207
+ return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
208
+
209
+ chunks = [_strip_chunk_title(c) for c in raw_chunks]
210
+ claims = decompose_claims(response)
211
+ if not claims:
212
+ return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No claims extracted")
213
+
214
+ sentences = _context_sentences(chunks)
215
+ model = get_nli_model()
216
+ claim_results: list[dict[str, Any]] = []
217
+
218
+ for claim in claims:
219
+ pairs = [(s, claim) for s in sentences]
220
+ scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
221
+ entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
222
+ best = float(entailment.max())
223
+ claim_results.append({"claim": claim, "score": round(best, 3), "supported": best >= FAITHFULNESS_THRESHOLD})
224
+
225
+ supported = sum(1 for c in claim_results if c["supported"])
226
+ score = supported / len(claim_results)
227
+ log.info("Claim decomposition: %d/%d supported (score=%.3f)", supported, len(claim_results), score)
228
+
229
+ return GradeResult(
230
+ metric="faithfulness",
231
+ passed=score >= FAITHFULNESS_THRESHOLD,
232
+ score=score,
233
+ detail=f"{supported}/{len(claim_results)} claims supported (threshold: {FAITHFULNESS_THRESHOLD})",
234
+ metadata={"claims": claim_results},
235
+ )
236
+
237
+
238
  def grade_chain_terminology(response: str, client: str) -> GradeResult:
239
  """Check that the response uses client-specific terms, not rival terminology."""
240
  result = check_terminology(response, client)
 
268
  grade_pii_leakage(response),
269
  grade_token_budget(response, token_budget),
270
  grade_answer_relevancy(query, response),
271
+ grade_faithfulness_decomposed(response, context),
272
  grade_chain_terminology(response, client),
273
  ]
274
  return report
eval/compare_faithfulness.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Side-by-side comparison: whole-response faithfulness vs claim-level decomposition.
3
+
4
+ Each golden-dataset pair is run through both graders using the full domain KB as context
5
+ (simulates retrieval returning all docs β€” maximum pressure on the NLI signal).
6
+
7
+ Output: aligned table with per-pair scores + delta, plus summary distributions.
8
+
9
+ Usage:
10
+ cd /Users/praca/ai-response-validator && .venv/bin/python eval/compare_faithfulness.py
11
+ """
12
+
13
+ import statistics
14
+ import sys
15
+ from pathlib import Path
16
+
17
+ import yaml
18
+
19
+ sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
20
+
21
+ from grader import (
22
+ FAITHFULNESS_THRESHOLD,
23
+ grade_faithfulness,
24
+ grade_faithfulness_decomposed,
25
+ )
26
+
27
+ DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
28
+ KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
29
+
30
+
31
+ def _load_pairs() -> list[dict]:
32
+ return yaml.safe_load(DATASET_PATH.read_text())["pairs"]
33
+
34
+
35
+ def _load_kb_context(domain: str) -> str:
36
+ path = KNOWLEDGE_ROOT / domain / "features.yaml"
37
+ data = yaml.safe_load(path.read_text())
38
+ chunks = [
39
+ f"[{doc['title']}]\n{doc['content'].strip()}"
40
+ for doc in data["documents"]
41
+ ]
42
+ return "\n\n".join(chunks)
43
+
44
+
45
+ def _fmt(score: float | None) -> str:
46
+ return f"{score:.3f}" if score is not None else " β€” "
47
+
48
+
49
+ def run() -> None:
50
+ pairs = _load_pairs()
51
+ kb: dict[str, str] = {}
52
+
53
+ print(f"\nFaithfulness comparison β€” {len(pairs)} golden-dataset pairs")
54
+ print("Context: full domain KB (all docs, simulating broad retrieval)\n")
55
+
56
+ header = f"{'id':<20} {'whole':>7} {'decomp':>7} {'delta':>7} {'claims':>6} {'sup/tot':>7} note"
57
+ print(header)
58
+ print("-" * len(header))
59
+
60
+ whole_scores: list[float] = []
61
+ decomp_scores: list[float] = []
62
+ deltas: list[float] = []
63
+ refusals: list[str] = []
64
+
65
+ for pair in pairs:
66
+ pid = pair["id"]
67
+ domain = pair["domain"]
68
+ response = pair["expected_answer"].strip()
69
+
70
+ if domain not in kb:
71
+ kb[domain] = _load_kb_context(domain)
72
+ context = kb[domain]
73
+
74
+ w = grade_faithfulness(response, context)
75
+ d = grade_faithfulness_decomposed(response, context)
76
+
77
+ if "Refusal" in w.detail:
78
+ refusals.append(pid)
79
+ print(f"{pid:<20} {'REFUSAL':>7} {'REFUSAL':>7} {'':>7} {'':>6} {'':>7}")
80
+ continue
81
+
82
+ whole_scores.append(w.score)
83
+ decomp_scores.append(d.score)
84
+ delta = d.score - w.score
85
+ deltas.append(delta)
86
+
87
+ claims_meta = d.metadata.get("claims", [])
88
+ n_claims = len(claims_meta)
89
+ n_supported = sum(1 for c in claims_meta if c["supported"])
90
+ sup_tot = f"{n_supported}/{n_claims}"
91
+
92
+ note = ""
93
+ if abs(delta) >= 0.15:
94
+ note = "<-- gap"
95
+
96
+ sign = "+" if delta >= 0 else ""
97
+ print(
98
+ f"{pid:<20} {w.score:>7.3f} {d.score:>7.3f} {sign}{delta:>6.3f} {n_claims:>6} {sup_tot:>7} {note}"
99
+ )
100
+
101
+ print("-" * len(header))
102
+ print()
103
+
104
+ if whole_scores:
105
+ print("Score distributions (refusals excluded):\n")
106
+ for name, scores in [("whole_response", whole_scores), ("decomposed", decomp_scores)]:
107
+ below = sum(1 for s in scores if s < FAITHFULNESS_THRESHOLD)
108
+ print(
109
+ f" {name:<16} "
110
+ f"min={min(scores):.3f} "
111
+ f"p25={sorted(scores)[len(scores)//4]:.3f} "
112
+ f"median={statistics.median(scores):.3f} "
113
+ f"p75={sorted(scores)[3*len(scores)//4]:.3f} "
114
+ f"max={max(scores):.3f} "
115
+ f"below_threshold={below}/{len(scores)}"
116
+ )
117
+
118
+ print()
119
+ neg_delta = sum(1 for d in deltas if d < -0.05)
120
+ mean_abs = statistics.mean(abs(d) for d in deltas)
121
+ print(f" mean |delta| : {mean_abs:.3f}")
122
+ print(f" decomp < whole : {neg_delta}/{len(deltas)} pairs (whole-response was optimistic here)")
123
+ print(f" threshold : {FAITHFULNESS_THRESHOLD}")
124
+
125
+ if refusals:
126
+ print(f"\n Refusals (auto-pass, excluded from stats): {', '.join(refusals)}")
127
+
128
+ print()
129
+
130
+
131
+ if __name__ == "__main__":
132
+ run()
eval/drift.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Drift detection: compare live grader score distributions against the golden-dataset baseline.
3
+
4
+ Answers: has answer quality shifted since the reference was established?
5
+ Catches: model updates, KB staleness, query distribution shift, threshold miscalibration.
6
+
7
+ Statistical test: KS two-sample (same as Evidently DataDriftPreset for numerical columns).
8
+ - H0: current and reference are drawn from the same distribution
9
+ - H1: distributions differ
10
+ - Drifted if p_value < alpha (default 0.05)
11
+
12
+ Reference: golden-dataset expected_answer scores (known-good baseline).
13
+ Current: in-memory telemetry._events from the running API session.
14
+
15
+ Usage:
16
+ cd /Users/praca/ai-response-validator && .venv/bin/python eval/drift.py
17
+ """
18
+
19
+ import sys
20
+ from dataclasses import dataclass
21
+ from pathlib import Path
22
+
23
+ import yaml
24
+ from scipy.stats import ks_2samp
25
+
26
+ sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
27
+
28
+ from grader import (
29
+ grade_answer_relevancy,
30
+ grade_chain_terminology,
31
+ grade_faithfulness_decomposed,
32
+ grade_pii_leakage,
33
+ grade_token_budget,
34
+ )
35
+
36
+ DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
37
+ KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
38
+
39
+ METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
40
+ ALPHA = 0.05
41
+ MIN_CURRENT_SAMPLES = 5
42
+
43
+
44
+ @dataclass(slots=True)
45
+ class MetricDrift:
46
+ metric: str
47
+ ks_statistic: float
48
+ p_value: float
49
+ drifted: bool
50
+ ref_mean: float
51
+ cur_mean: float
52
+ ref_n: int
53
+ cur_n: int
54
+
55
+
56
+ def _load_kb_context(domain: str) -> str:
57
+ path = KNOWLEDGE_ROOT / domain / "features.yaml"
58
+ data = yaml.safe_load(path.read_text())
59
+ chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
60
+ return "\n\n".join(chunks)
61
+
62
+
63
+ Scores = dict[str, list[float]]
64
+
65
+
66
+ def build_reference() -> Scores:
67
+ """Score every golden-dataset pair with all graders."""
68
+ pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
69
+ kb: dict[str, str] = {}
70
+ scores: Scores = {m: [] for m in METRICS}
71
+
72
+ for pair in pairs:
73
+ response = pair["expected_answer"].strip()
74
+ domain = pair["domain"]
75
+ if domain not in kb:
76
+ kb[domain] = _load_kb_context(domain)
77
+
78
+ scores["pii_leakage"].append(grade_pii_leakage(response).score)
79
+ scores["token_budget"].append(grade_token_budget(response).score)
80
+ scores["answer_relevancy"].append(grade_answer_relevancy(pair["question"], response).score)
81
+ scores["faithfulness"].append(grade_faithfulness_decomposed(response, kb[domain]).score)
82
+ scores["chain_terminology"].append(grade_chain_terminology(response, pair["client"]).score)
83
+
84
+ return scores
85
+
86
+
87
+ def build_current() -> Scores:
88
+ """Pull metric scores from the in-memory telemetry buffer."""
89
+ import telemetry
90
+
91
+ with telemetry._lock:
92
+ events = list(telemetry._events)
93
+
94
+ scores: Scores = {m: [] for m in METRICS}
95
+ for event in events:
96
+ if "metrics" not in event:
97
+ continue
98
+ if any(event["metrics"].get(m) is None for m in METRICS):
99
+ continue
100
+ for m in METRICS:
101
+ scores[m].append(float(event["metrics"][m]))
102
+
103
+ return scores
104
+
105
+
106
+ def detect_drift(
107
+ current: Scores,
108
+ reference: Scores,
109
+ alpha: float = ALPHA,
110
+ ) -> list[MetricDrift]:
111
+ """Run KS two-sample test per metric. Skips metrics with fewer than MIN_CURRENT_SAMPLES."""
112
+ results: list[MetricDrift] = []
113
+
114
+ for metric in METRICS:
115
+ ref_col = reference.get(metric, [])
116
+ cur_col = current.get(metric, [])
117
+
118
+ if len(cur_col) < MIN_CURRENT_SAMPLES or len(ref_col) == 0:
119
+ continue
120
+
121
+ import numpy as np
122
+ ref_arr = np.array(ref_col, dtype=float)
123
+ cur_arr = np.array(cur_col, dtype=float)
124
+
125
+ stat, pval = ks_2samp(ref_arr, cur_arr)
126
+ results.append(MetricDrift(
127
+ metric=metric,
128
+ ks_statistic=round(float(stat), 4),
129
+ p_value=round(float(pval), 4),
130
+ drifted=bool(pval < alpha),
131
+ ref_mean=round(float(ref_arr.mean()), 4),
132
+ cur_mean=round(float(cur_arr.mean()), 4),
133
+ ref_n=len(ref_arr),
134
+ cur_n=len(cur_arr),
135
+ ))
136
+
137
+ return results
138
+
139
+
140
+ def report_drift(results: list[MetricDrift], alpha: float = ALPHA) -> None:
141
+ header = (
142
+ f"{'metric':<22} {'ks_stat':>7} {'p_value':>7} {'status':>10}"
143
+ f" {'ref_mean':>8} {'cur_mean':>8} {'delta':>7}"
144
+ )
145
+ print(header)
146
+ print("-" * len(header))
147
+
148
+ for r in results:
149
+ status = "DRIFT <--" if r.drifted else "ok"
150
+ delta = r.cur_mean - r.ref_mean
151
+ sign = "+" if delta >= 0 else ""
152
+ print(
153
+ f"{r.metric:<22} {r.ks_statistic:>7.4f} {r.p_value:>7.4f} {status:>10}"
154
+ f" {r.ref_mean:>8.4f} {r.cur_mean:>8.4f} {sign}{delta:>6.4f}"
155
+ )
156
+
157
+ drifted = [r for r in results if r.drifted]
158
+ print(f"\nOverall: {len(drifted)}/{len(results)} metrics drifted (alpha={alpha})")
159
+
160
+ if drifted:
161
+ print("\nDrifted metrics:")
162
+ for r in drifted:
163
+ direction = "degraded" if r.cur_mean < r.ref_mean else "improved"
164
+ print(f" {r.metric}: {direction} ({r.ref_mean:.3f} β†’ {r.cur_mean:.3f})")
165
+
166
+
167
+ def run() -> None:
168
+ print("\nBuilding reference distribution from golden-dataset.yaml...")
169
+ reference = build_reference()
170
+ ref_n = len(next(iter(reference.values()), []))
171
+ print(f"Reference: {ref_n} pairs\n")
172
+
173
+ current = build_current()
174
+
175
+ cur_n = len(next(iter(current.values()), []))
176
+ if cur_n < MIN_CURRENT_SAMPLES:
177
+ import numpy as np
178
+ print(
179
+ f"Current: {cur_n} telemetry event(s) β€” need β‰₯{MIN_CURRENT_SAMPLES} to run KS test.\n"
180
+ f"Start the API and run some queries, then re-run drift.py.\n\n"
181
+ f"Reference distribution (golden baseline):\n"
182
+ )
183
+ for m in METRICS:
184
+ vals = np.array(reference[m])
185
+ print(f" {m:<22} mean={vals.mean():.3f} std={vals.std():.3f} min={vals.min():.3f} max={vals.max():.3f}")
186
+ return
187
+
188
+ print(f"Current: {cur_n} telemetry events\n")
189
+ results = detect_drift(current, reference)
190
+
191
+ if not results:
192
+ print("No metrics had enough data for KS test.\n")
193
+ return
194
+
195
+ report_drift(results)
196
+ print()
197
+
198
+
199
+ if __name__ == "__main__":
200
+ run()
eval/simulate_traffic.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Populate telemetry with simulated traffic, then run drift detection.
3
+
4
+ Two batches:
5
+ clean β€” golden-dataset expected_answers (should match reference distribution)
6
+ dirty β€” same questions, hallucinated responses (should show faithfulness drift)
7
+
8
+ Bypasses the API entirely: runs graders + telemetry.record() directly.
9
+
10
+ Usage:
11
+ cd /Users/praca/ai-response-validator && .venv/bin/python eval/simulate_traffic.py
12
+ """
13
+
14
+ import sys
15
+ import time
16
+ from pathlib import Path
17
+
18
+ import yaml
19
+
20
+ sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
21
+
22
+ import telemetry
23
+ from config import CLIENT_DOMAIN
24
+ from grader import GradeReport, grade
25
+
26
+ DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
27
+ KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
28
+
29
+ # Hallucinated responses β€” plausible-sounding but contradicts KB facts
30
+ HALLUCINATED: dict[str, str] = {
31
+ # retail β€” NovaMart
32
+ "retail-nm-001": (
33
+ "When a product runs out of stock, the system automatically places a reorder after 72 hours "
34
+ "with no alerts sent to any manager. The supplier is notified only at month-end review."
35
+ ),
36
+ "retail-nm-002": (
37
+ "To add a new supplier, send an email to the procurement team with the company name. "
38
+ "No tax ID or payment terms are required at this stage. "
39
+ "Purchase orders can be created immediately without waiting for validation."
40
+ ),
41
+ "retail-nm-003": (
42
+ "Feature flags are permanent once enabled and cannot be disabled without a code deployment. "
43
+ "There is no expiry date or activation scope. Any employee can enable a flag in production."
44
+ ),
45
+ "retail-nm-004": (
46
+ "The authoritative source for product information is the pricing portal. "
47
+ "SKU records are updated manually once per week by the merchandising team. "
48
+ "Archived products can be reactivated instantly by any store manager."
49
+ ),
50
+ "retail-nm-005": (
51
+ "Price changes take effect immediately upon submission with no approval required. "
52
+ "There is no sync window; prices update in real time. "
53
+ "Emergency corrections are handled automatically without escalation."
54
+ ),
55
+ # retail β€” ShelfWise
56
+ "retail-sw-001": (
57
+ "An out-of-stock alert fires only after a manual stock check is initiated by a store manager. "
58
+ "The alert is sent exclusively to the regional director. "
59
+ "No escalation occurs if the alert is unacknowledged."
60
+ ),
61
+ "retail-sw-002": (
62
+ "Feature toggles are permanent once enabled. "
63
+ "There is no activation scope and no expiry date requirement. "
64
+ "Any user can enable toggles in production without sign-off."
65
+ ),
66
+ "retail-sw-004": (
67
+ "Compliance reports are editable for up to 30 days after creation and are stored for 2 years. "
68
+ "Any user can access compliance reports from the standard dashboard. "
69
+ "Reports are generated on demand only."
70
+ ),
71
+ "retail-sw-005": (
72
+ "Product catalog updates require manual approval for each SKU and can take up to 48 hours. "
73
+ "Deactivated products are permanently deleted and cannot be recovered."
74
+ ),
75
+ # pharma β€” ClinixOne
76
+ "pharma-cx-001": (
77
+ "Prior authorization is optional and payers respond within 7 business days. "
78
+ "Denied requests cannot be appealed and the prescriber must choose an alternative drug."
79
+ ),
80
+ "pharma-cx-003": (
81
+ "Adverse events must be reported to regulators within 30 days for all event types. "
82
+ "A safety signal is raised automatically by the system when 3 or more events occur. "
83
+ "Expected events do not require regulatory reporting."
84
+ ),
85
+ "pharma-cx-004": (
86
+ "Clinical trials have two phases: Phase I for safety and Phase II for market approval. "
87
+ "Enrollment eligibility is determined by the treating physician with no formal criteria."
88
+ ),
89
+ # pharma β€” PharmaLink
90
+ "pharma-pl-001": (
91
+ "Formulary pre-approval is automatically granted for all branded drugs. "
92
+ "The payer responds within 30 days and denied requests cannot be appealed."
93
+ ),
94
+ "pharma-pl-003": (
95
+ "The formulary has two tiers: generic and branded. "
96
+ "Moving a drug to a higher tier requires a 7-day notice to prescribers. "
97
+ "Tier assignment is reviewed every 5 years."
98
+ ),
99
+ "pharma-pl-004": (
100
+ "A prescribing pathway is a marketing document produced by pharmaceutical companies. "
101
+ "Pathways are reviewed every 5 years and payers do not use them in coverage decisions. "
102
+ "Deviation from a pathway requires no documentation."
103
+ ),
104
+ "pharma-pl-005": (
105
+ "Enrollment authorization is a formality β€” patients sign a standard waiver. "
106
+ "Consent is obtained after the first study procedure, not before. "
107
+ "Protocol changes do not require re-consent from existing participants."
108
+ ),
109
+ }
110
+
111
+
112
+ def _load_kb_context(domain: str) -> str:
113
+ path = KNOWLEDGE_ROOT / domain / "features.yaml"
114
+ data = yaml.safe_load(path.read_text())
115
+ chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
116
+ return "\n\n".join(chunks)
117
+
118
+
119
+ def _record(pair: dict, response: str, context: str, tag: str) -> GradeReport:
120
+ client = pair["client"]
121
+ report = grade(
122
+ query=pair["question"],
123
+ response=response,
124
+ context=context,
125
+ client=client,
126
+ )
127
+ telemetry.record(
128
+ client=client,
129
+ domain=pair["domain"],
130
+ query_len=len(pair["question"].split()),
131
+ latency_ms={"retrieve": 12.0, "generate": 180.0, "grade": 45.0},
132
+ report=report,
133
+ docs_retrieved=3,
134
+ min_retrieval_score=0.72,
135
+ )
136
+ status = "PASS" if report.overall else "FAIL"
137
+ faith = next(r for r in report.results if r.metric == "faithfulness")
138
+ print(f" [{tag}] {pair['id']:<20} {status} faith={faith.score:.3f} {faith.detail}")
139
+ return report
140
+
141
+
142
+ def run() -> None:
143
+ pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
144
+ kb: dict[str, str] = {}
145
+
146
+ # ── Batch 1: clean traffic ──────────────────────────────────────────────
147
+ print("\n── Batch 1: clean traffic (expected answers) ──\n")
148
+ for pair in pairs:
149
+ domain = pair["domain"]
150
+ if domain not in kb:
151
+ kb[domain] = _load_kb_context(domain)
152
+ response = pair["expected_answer"].strip()
153
+ _record(pair, response, kb[domain], "clean")
154
+ time.sleep(0.05)
155
+
156
+ # ── Batch 2: dirty traffic (hallucinated responses) ─────────────────────
157
+ print("\n── Batch 2: dirty traffic (hallucinated responses) ──\n")
158
+ dirty_pairs = [p for p in pairs if p["id"] in HALLUCINATED]
159
+ for pair in dirty_pairs:
160
+ domain = pair["domain"]
161
+ response = HALLUCINATED[pair["id"]]
162
+ _record(pair, response, kb[domain], "dirty")
163
+ time.sleep(0.05)
164
+
165
+ total = telemetry.live_stats()["total_queries"]
166
+ print(f"\nTelemetry buffer: {total} events ({len(pairs)} clean + {len(dirty_pairs)} dirty)\n")
167
+
168
+ # ── Drift detection ─────────────────────────────────────────────────────
169
+ print("=" * 60)
170
+ print("Running drift detection vs golden-dataset baseline...")
171
+ print("=" * 60)
172
+
173
+ sys.path.insert(0, str(Path(__file__).parent))
174
+ from drift import build_current, build_reference, detect_drift, report_drift
175
+
176
+ print("\nBuilding reference distribution...")
177
+ reference = build_reference()
178
+
179
+ current = build_current()
180
+ cur_n = len(next(iter(current.values()), []))
181
+ print(f"Reference: {len(next(iter(reference.values())))} pairs")
182
+ print(f"Current: {cur_n} events\n")
183
+
184
+ results = detect_drift(current, reference)
185
+ report_drift(results)
186
+ print()
187
+
188
+
189
+ if __name__ == "__main__":
190
+ run()
tests/unit/test_drift.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Unit tests for drift detection β€” detect_drift() only.
3
+ No model loading, no IO, no telemetry.
4
+ """
5
+
6
+ import sys
7
+ from pathlib import Path
8
+
9
+ import numpy as np
10
+ import pytest
11
+
12
+ sys.path.insert(0, str(Path(__file__).parent.parent.parent / "eval"))
13
+
14
+ from drift import ALPHA, MIN_CURRENT_SAMPLES, MetricDrift, detect_drift
15
+
16
+ METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
17
+
18
+
19
+ def _scores(n: int, **col_values: list[float]) -> dict[str, list[float]]:
20
+ """Build a Scores dict with fixed values per column; defaults to 0.9 for others."""
21
+ data: dict[str, list[float]] = {}
22
+ for metric in METRICS:
23
+ data[metric] = col_values.get(metric, [0.9] * n)
24
+ return data
25
+
26
+
27
+ class TestDetectDrift:
28
+ def test_identical_distributions_no_drift(self) -> None:
29
+ rng = np.random.default_rng(42)
30
+ scores = rng.uniform(0.5, 1.0, 50).tolist()
31
+ ref = _scores(50, faithfulness=scores)
32
+ cur = _scores(50, faithfulness=scores)
33
+ results = detect_drift(cur, ref)
34
+ faith = next(r for r in results if r.metric == "faithfulness")
35
+ assert faith.drifted is False
36
+
37
+ def test_shifted_distribution_detected(self) -> None:
38
+ ref = _scores(50, faithfulness=[0.9] * 50)
39
+ cur = _scores(50, faithfulness=[0.1] * 50)
40
+ results = detect_drift(cur, ref)
41
+ faith = next(r for r in results if r.metric == "faithfulness")
42
+ assert faith.drifted is True
43
+ assert faith.p_value < ALPHA
44
+
45
+ def test_below_min_samples_excluded(self) -> None:
46
+ ref = _scores(50)
47
+ cur = _scores(MIN_CURRENT_SAMPLES - 1)
48
+ results = detect_drift(cur, ref)
49
+ assert results == []
50
+
51
+ def test_exactly_min_samples_included(self) -> None:
52
+ ref = _scores(50)
53
+ cur = _scores(MIN_CURRENT_SAMPLES)
54
+ results = detect_drift(cur, ref)
55
+ assert len(results) == len(METRICS)
56
+
57
+ def test_ks_statistic_in_range(self) -> None:
58
+ ref = _scores(50, faithfulness=[0.9] * 50)
59
+ cur = _scores(50, faithfulness=[0.1] * 50)
60
+ results = detect_drift(cur, ref)
61
+ faith = next(r for r in results if r.metric == "faithfulness")
62
+ assert 0.0 <= faith.ks_statistic <= 1.0
63
+
64
+ def test_means_computed_correctly(self) -> None:
65
+ ref = _scores(10, faithfulness=[0.8] * 10)
66
+ cur = _scores(10, faithfulness=[0.4] * 10)
67
+ results = detect_drift(cur, ref)
68
+ faith = next(r for r in results if r.metric == "faithfulness")
69
+ assert faith.ref_mean == pytest.approx(0.8, abs=1e-3)
70
+ assert faith.cur_mean == pytest.approx(0.4, abs=1e-3)
71
+
72
+ def test_all_metrics_returned(self) -> None:
73
+ ref = _scores(30)
74
+ cur = _scores(30)
75
+ result_names = {r.metric for r in detect_drift(cur, ref)}
76
+ assert result_names == set(METRICS)
77
+
78
+ def test_result_is_metric_drift_dataclass(self) -> None:
79
+ ref = _scores(20)
80
+ cur = _scores(20)
81
+ for r in detect_drift(cur, ref):
82
+ assert isinstance(r, MetricDrift)
83
+ assert isinstance(r.drifted, bool)
84
+ assert isinstance(r.ks_statistic, float)
85
+ assert isinstance(r.p_value, float)
86
+
87
+ def test_custom_alpha_respected(self) -> None:
88
+ rng = np.random.default_rng(0)
89
+ ref = _scores(50, faithfulness=rng.uniform(0.7, 1.0, 50).tolist())
90
+ cur = _scores(50, faithfulness=rng.uniform(0.4, 0.7, 50).tolist())
91
+ strict = detect_drift(cur, ref, alpha=0.001)
92
+ lenient = detect_drift(cur, ref, alpha=0.999)
93
+ faith_strict = next(r for r in strict if r.metric == "faithfulness")
94
+ faith_lenient = next(r for r in lenient if r.metric == "faithfulness")
95
+ assert faith_lenient.drifted or not faith_strict.drifted
96
+
97
+ def test_missing_metric_column_skipped(self) -> None:
98
+ ref: dict[str, list[float]] = {"faithfulness": [0.9] * 20}
99
+ cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
100
+ results = detect_drift(cur, ref)
101
+ assert all(r.metric == "faithfulness" for r in results)
102
+ assert len(results) == 1
103
+
104
+ def test_empty_reference_skipped(self) -> None:
105
+ ref: dict[str, list[float]] = {"faithfulness": []}
106
+ cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
107
+ results = detect_drift(cur, ref)
108
+ assert results == []
109
+
110
+ def test_sample_counts_in_result(self) -> None:
111
+ ref = _scores(30)
112
+ cur = _scores(10)
113
+ results = detect_drift(cur, ref)
114
+ for r in results:
115
+ assert r.ref_n == 30
116
+ assert r.cur_n == 10
tests/unit/test_grader.py CHANGED
@@ -11,10 +11,17 @@ import pytest
11
 
12
  sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
13
 
 
 
 
 
14
  from grader import (
15
  grade_pii_leakage,
16
  grade_token_budget,
17
  grade_chain_terminology,
 
 
 
18
  TOKEN_BUDGET,
19
  )
20
 
@@ -138,3 +145,111 @@ class TestChainTerminology:
138
  )
139
  assert result.passed is False
140
  assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
13
 
14
+ from unittest.mock import MagicMock, patch
15
+
16
+ import numpy as np
17
+
18
  from grader import (
19
  grade_pii_leakage,
20
  grade_token_budget,
21
  grade_chain_terminology,
22
+ decompose_claims,
23
+ grade_faithfulness_decomposed,
24
+ FAITHFULNESS_THRESHOLD,
25
  TOKEN_BUDGET,
26
  )
27
 
 
145
  )
146
  assert result.passed is False
147
  assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])
148
+
149
+
150
+ # ── decompose_claims ──────────────────────────────────────────────────────────
151
+
152
+ class TestDecomposeClaims:
153
+ def test_single_sentence(self) -> None:
154
+ claims = decompose_claims("The product is in stock.")
155
+ assert claims == ["The product is in stock."]
156
+
157
+ def test_multi_sentence_split(self) -> None:
158
+ claims = decompose_claims("The product is in stock. It costs five dollars. Delivery takes two days.")
159
+ assert len(claims) == 3
160
+
161
+ def test_fragments_under_three_words_excluded(self) -> None:
162
+ claims = decompose_claims("Yes. The product is available in all sizes.")
163
+ assert all(len(c.split()) >= 3 for c in claims)
164
+
165
+ def test_exclamation_and_question_split(self) -> None:
166
+ claims = decompose_claims("Stock is low! Would you like to reorder? The threshold is five units.")
167
+ assert len(claims) == 3
168
+
169
+ def test_empty_string_returns_empty(self) -> None:
170
+ assert decompose_claims("") == []
171
+
172
+
173
+ # ── grade_faithfulness_decomposed ────────────────────────────────────────────
174
+
175
+ def _make_nli(entailment: float) -> MagicMock:
176
+ """Mock CrossEncoder whose predict() always returns the given entailment score."""
177
+ mock = MagicMock()
178
+ # columns: [contradiction, entailment, neutral]
179
+ mock.predict = MagicMock(
180
+ side_effect=lambda pairs, **kw: np.array([[0.1, entailment, 0.0]] * len(pairs))
181
+ )
182
+ return mock
183
+
184
+
185
+ CONTEXT = "The product costs five dollars.\n\nDelivery takes two days."
186
+
187
+
188
+ class TestGradeFaithfulnessDecomposed:
189
+ def test_all_claims_supported_passes(self) -> None:
190
+ with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
191
+ result = grade_faithfulness_decomposed(
192
+ "The product costs five dollars. Delivery takes two days.", CONTEXT
193
+ )
194
+ assert result.passed is True
195
+ assert result.score == 1.0
196
+ assert result.metadata["claims"][0]["supported"] is True
197
+
198
+ def test_all_claims_unsupported_fails(self) -> None:
199
+ with patch("grader.get_nli_model", return_value=_make_nli(0.1)):
200
+ result = grade_faithfulness_decomposed(
201
+ "The product costs five dollars. Delivery takes two days.", CONTEXT
202
+ )
203
+ assert result.passed is False
204
+ assert result.score == 0.0
205
+
206
+ def test_partial_hallucination_detected(self) -> None:
207
+ # first claim supported, second not β€” whole-response NLI would miss this
208
+ call_count = 0
209
+
210
+ def side_effect(pairs: list, **kw: object) -> np.ndarray:
211
+ nonlocal call_count
212
+ call_count += 1
213
+ entailment = 0.9 if call_count == 1 else 0.1
214
+ return np.array([[0.1, entailment, 0.0]] * len(pairs))
215
+
216
+ mock_model = MagicMock()
217
+ mock_model.predict = MagicMock(side_effect=side_effect)
218
+ with patch("grader.get_nli_model", return_value=mock_model):
219
+ result = grade_faithfulness_decomposed(
220
+ "The product costs five dollars. It was invented in 1842.", CONTEXT
221
+ )
222
+ assert result.score == 0.5
223
+ assert result.metadata["claims"][0]["supported"] is True
224
+ assert result.metadata["claims"][1]["supported"] is False
225
+
226
+ def test_refusal_auto_passes(self) -> None:
227
+ result = grade_faithfulness_decomposed(
228
+ "I don't have enough information to answer that.", CONTEXT
229
+ )
230
+ assert result.passed is True
231
+ assert result.score == 1.0
232
+
233
+ def test_empty_context_fails(self) -> None:
234
+ with patch("grader.get_nli_model"):
235
+ result = grade_faithfulness_decomposed("The product costs five dollars.", "")
236
+ assert result.passed is False
237
+ assert result.score == 0.0
238
+
239
+ def test_metadata_shape(self) -> None:
240
+ with patch("grader.get_nli_model", return_value=_make_nli(0.8)):
241
+ result = grade_faithfulness_decomposed(
242
+ "The product is available. It ships in two days.", CONTEXT
243
+ )
244
+ for entry in result.metadata["claims"]:
245
+ assert "claim" in entry
246
+ assert "score" in entry
247
+ assert "supported" in entry
248
+
249
+ def test_score_is_proportion_not_max(self) -> None:
250
+ """Verify score = supported/total, not max(entailment_scores)."""
251
+ with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
252
+ result = grade_faithfulness_decomposed(
253
+ "Claim one is true. Claim two is also true. Claim three too.", CONTEXT
254
+ )
255
+ assert result.score == 1.0