Replace HHEM with sentence-level NLI, add claim decomposition and drift detection
Browse filesFaithfulness grader:
- Replace broken Vectara HHEM v2 (zero embedding matrix) with cross-encoder/nli-deberta-v3-small
- Add decompose_claims() β splits response into atomic sentences for per-claim verification
- Add _context_sentences() β splits KB chunks into sentences before NLI scoring; fixes
paragraph-level entailment collapse (verbatim text was scoring ent=0.002 at paragraph level,
ent=0.995 at sentence level including aliased terms like "item registry" vs "product catalog")
- grade_faithfulness_decomposed() promoted to default in grade(); score = supported/total claims
Drift detection:
- eval/drift.py: KS two-sample test per metric vs golden-dataset baseline
- eval/compare_faithfulness.py: side-by-side whole-response vs claim-level scores
- eval/simulate_traffic.py: clean + hallucinated traffic simulation for drift testing
- tests/unit/test_drift.py: 12 unit tests for detect_drift()
Docs updated to reflect all changes (ARCHITECTURE.md, NOTES.md, README.md)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ARCHITECTURE.md +69 -32
- NOTES.md +38 -6
- README.md +15 -4
- backend/grader.py +84 -42
- eval/compare_faithfulness.py +132 -0
- eval/drift.py +200 -0
- eval/simulate_traffic.py +190 -0
- tests/unit/test_drift.py +116 -0
- tests/unit/test_grader.py +115 -0
|
@@ -60,7 +60,7 @@ Runs inline with every request. No ground truth required.
|
|
| 60 |
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard |
|
| 61 |
| `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement |
|
| 62 |
| `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection |
|
| 63 |
-
| `faithfulness` |
|
| 64 |
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
|
| 65 |
|
| 66 |
### L2 β Batch (local, against golden dataset)
|
|
@@ -68,10 +68,14 @@ Runs inline with every request. No ground truth required.
|
|
| 68 |
```bash
|
| 69 |
python eval/metrics.py --domain retail
|
| 70 |
python eval/metrics.py --client novamart --out results.json
|
|
|
|
|
|
|
|
|
|
| 71 |
```
|
| 72 |
|
| 73 |
-
Runs
|
| 74 |
-
|
|
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
|
@@ -85,25 +89,25 @@ Two fundamentally different model architectures serve different roles in this sy
|
|
| 85 |
|---|---|---|
|
| 86 |
| **How it works** | Encodes query and document independently β compare embeddings | Encodes query + document jointly β single relevance score |
|
| 87 |
| **Speed** | Fast β embeddings pre-computed at index build time | Slow β must re-encode every (query, doc) pair at inference |
|
| 88 |
-
| **Quality** | Good for retrieval: finds semantically similar docs | Better for
|
| 89 |
-
| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (
|
| 90 |
|
| 91 |
**Measured overhead (CPU, HF Spaces):**
|
| 92 |
|
| 93 |
| Step | Model | Typical latency |
|
| 94 |
|------|-------|----------------|
|
| 95 |
| Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10β15 ms |
|
| 96 |
-
| KB cosine search
|
| 97 |
| Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
|
| 98 |
-
| Faithfulness (
|
| 99 |
-
| Total grading overhead | β | ~
|
| 100 |
|
| 101 |
**Why bi-encoder for retrieval:** query time is constant regardless of KB size because
|
| 102 |
document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
|
| 103 |
query latency β only index build time grows.
|
| 104 |
|
| 105 |
**Why cross-encoder for faithfulness:** cross-encoders see both the document and the
|
| 106 |
-
|
| 107 |
can be semantically similar to a document (high cosine) while still hallucinating specific
|
| 108 |
facts β the cross-encoder catches this, the bi-encoder does not.
|
| 109 |
|
|
@@ -124,22 +128,23 @@ It flags rival-client terms appearing without the correct client term.
|
|
| 124 |
**Why this matters:** in production multi-tenant AI systems, terminology leakage
|
| 125 |
between clients is a real failure mode. This catches it mechanically.
|
| 126 |
|
| 127 |
-
### Faithfulness
|
| 128 |
|
| 129 |
-
The faithfulness grader uses
|
| 130 |
-
a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
|
| 131 |
-
It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
|
| 132 |
-
the response is factually consistent with the document.
|
| 133 |
|
| 134 |
-
**
|
| 135 |
-
|
| 136 |
-
is
|
| 137 |
|
| 138 |
-
**
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
### In-memory semantic retrieval
|
| 145 |
|
|
@@ -193,8 +198,12 @@ knowledge/
|
|
| 193 |
features.yaml KB documents for retrieval
|
| 194 |
|
| 195 |
eval/
|
| 196 |
-
golden-dataset.yaml
|
| 197 |
-
metrics.py
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
ui/
|
| 200 |
index.html Chat interface + eval panel
|
|
@@ -207,7 +216,10 @@ ui/
|
|
| 207 |
|
| 208 |
| Decision | Alternative | Why this |
|
| 209 |
|----------|-------------|----------|
|
| 210 |
-
| Vectara HHEM v2
|
|
|
|
|
|
|
|
|
|
| 211 |
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
|
| 212 |
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
|
| 213 |
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
|
|
@@ -219,10 +231,35 @@ ui/
|
|
| 219 |
|
| 220 |
## Evaluation coverage vs RAGAS
|
| 221 |
|
| 222 |
-
| RAGAS metric | Coverage |
|
| 223 |
-
|---|---|
|
| 224 |
-
| faithfulness | β L1 (
|
| 225 |
-
| answer_relevancy | β L1 (cosine) + L2 (keyphrase) |
|
| 226 |
-
| context_precision | partial β retrieval score
|
| 227 |
-
| context_recall | β L2 (keyphrase coverage) |
|
| 228 |
-
| answer_correctness | β L2 (keyphrase + expected_answer) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard |
|
| 61 |
| `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement |
|
| 62 |
| `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection |
|
| 63 |
+
| `faithfulness` | Claim decomposition + sentence-level NLI | β₯ 0.35 (proportion) | Hallucination detection, claim-level granularity |
|
| 64 |
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
|
| 65 |
|
| 66 |
### L2 β Batch (local, against golden dataset)
|
|
|
|
| 68 |
```bash
|
| 69 |
python eval/metrics.py --domain retail
|
| 70 |
python eval/metrics.py --client novamart --out results.json
|
| 71 |
+
python eval/calibrate.py # threshold distribution on golden answers
|
| 72 |
+
python eval/compare_faithfulness.py # whole-response vs claim-level side-by-side
|
| 73 |
+
python eval/drift.py # KS drift detection vs golden baseline
|
| 74 |
```
|
| 75 |
|
| 76 |
+
Runs golden pairs through the full pipeline. Adds keyphrase coverage scoring on top of
|
| 77 |
+
L1 metrics to verify factual completeness against reference answers. `drift.py` compares
|
| 78 |
+
live telemetry score distributions against the golden baseline using KS two-sample tests.
|
| 79 |
|
| 80 |
---
|
| 81 |
|
|
|
|
| 89 |
|---|---|---|
|
| 90 |
| **How it works** | Encodes query and document independently β compare embeddings | Encodes query + document jointly β single relevance score |
|
| 91 |
| **Speed** | Fast β embeddings pre-computed at index build time | Slow β must re-encode every (query, doc) pair at inference |
|
| 92 |
+
| **Quality** | Good for retrieval: finds semantically similar docs | Better for NLI: captures fine-grained entailment between short sequences |
|
| 93 |
+
| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (`nli-deberta-v3-small`) |
|
| 94 |
|
| 95 |
**Measured overhead (CPU, HF Spaces):**
|
| 96 |
|
| 97 |
| Step | Model | Typical latency |
|
| 98 |
|------|-------|----------------|
|
| 99 |
| Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10β15 ms |
|
| 100 |
+
| KB cosine search | numpy matrix multiply | ~2 ms |
|
| 101 |
| Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
|
| 102 |
+
| Faithfulness (N claim Γ M sentence pairs) | cross-encoder NLI | ~200β500 ms |
|
| 103 |
+
| Total grading overhead | β | ~250β550 ms |
|
| 104 |
|
| 105 |
**Why bi-encoder for retrieval:** query time is constant regardless of KB size because
|
| 106 |
document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
|
| 107 |
query latency β only index build time grows.
|
| 108 |
|
| 109 |
**Why cross-encoder for faithfulness:** cross-encoders see both the document and the
|
| 110 |
+
claim simultaneously, capturing entailment relationships bi-encoders miss. A response
|
| 111 |
can be semantically similar to a document (high cosine) while still hallucinating specific
|
| 112 |
facts β the cross-encoder catches this, the bi-encoder does not.
|
| 113 |
|
|
|
|
| 128 |
**Why this matters:** in production multi-tenant AI systems, terminology leakage
|
| 129 |
between clients is a real failure mode. This catches it mechanically.
|
| 130 |
|
| 131 |
+
### Faithfulness: claim decomposition + sentence-level NLI
|
| 132 |
|
| 133 |
+
The faithfulness grader (`grade_faithfulness_decomposed`) uses a three-step pipeline:
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
1. **Claim decomposition** β response split into individual sentences via regex. Each sentence is an atomic claim to verify independently.
|
| 136 |
+
2. **Context sentence splitting** β KB chunks split into individual sentences before scoring. NLI cross-encoders are calibrated on sentence-pair inputs (SNLI/MNLI training format); paragraph-level inputs degrade performance significantly (verbatim text scores near 0.002 entailment when the context is a 3β4 sentence paragraph).
|
| 137 |
+
3. **Per-claim NLI scoring** β each claim scored against every context sentence. Claim is "supported" if max entailment β₯ threshold. Score = supported_claims / total_claims.
|
| 138 |
|
| 139 |
+
**Model:** `cross-encoder/nli-deberta-v3-small` β 3-class NLI (contradiction / entailment / neutral). Entailment column used. Sentence-level inputs give `ent β₯ 0.98` for verbatim and aliased claims ("item registry" vs "product catalog (item registry)").
|
| 140 |
+
|
| 141 |
+
**Why claim-level not whole-response NLI:** whole-response NLI misses partial hallucinations. A 4-sentence response with 3 correct sentences and 1 fabricated one scores high because the model finds one well-grounded sentence. Claim-level scores 3/4 = 0.75 and exposes the fabrication in metadata.
|
| 142 |
+
|
| 143 |
+
**Why sentence-level context not paragraph-level:** NLI cross-encoders are trained on single (premise, hypothesis) sentence pairs. Feeding a paragraph as premise causes entailment scores to collapse β the model distributes probability mass across longer sequences in ways not seen during training. Sentence-level splitting resolves alias mismatches too: `"pricing sync"` vs `"Price updates (pricing syncs) must be submitted..."` scores `ent=0.986` at sentence level.
|
| 144 |
+
|
| 145 |
+
**Why not Claude-as-judge for L1:** adds API cost and latency per query; non-deterministic. The cross-encoder handles L1; LLM-as-judge belongs in L2 batch evaluation for authoritative ground-truth comparison.
|
| 146 |
+
|
| 147 |
+
**Why not Vectara HHEM v2:** HHEM v2 checkpoint is missing `t5.transformer.encoder.embed_tokens.weight` β the embedding matrix is zero-initialized, producing a constant 0.502 probability for every input regardless of content. Diagnosed via `embed_tokens.std() == 0.0`.
|
| 148 |
|
| 149 |
### In-memory semantic retrieval
|
| 150 |
|
|
|
|
| 198 |
features.yaml KB documents for retrieval
|
| 199 |
|
| 200 |
eval/
|
| 201 |
+
golden-dataset.yaml 24 Q&A pairs (20 standard + 4 adversarial edge cases)
|
| 202 |
+
metrics.py L2 batch runner β CLI, keyphrase scoring, HTML report
|
| 203 |
+
calibrate.py Threshold calibration β score distributions on golden answers
|
| 204 |
+
compare_faithfulness.py Side-by-side: whole-response vs claim-level faithfulness scores
|
| 205 |
+
drift.py KS drift detection β live telemetry vs golden baseline
|
| 206 |
+
simulate_traffic.py Populate telemetry with clean + hallucinated traffic for drift testing
|
| 207 |
|
| 208 |
ui/
|
| 209 |
index.html Chat interface + eval panel
|
|
|
|
| 216 |
|
| 217 |
| Decision | Alternative | Why this |
|
| 218 |
|----------|-------------|----------|
|
| 219 |
+
| `nli-deberta-v3-small` + sentence splitting | Vectara HHEM v2 / Claude-as-judge | HHEM broken (zero embeddings); DeBERTa works at sentence level; no API cost |
|
| 220 |
+
| Claim-level faithfulness (proportion) | Whole-response NLI | Whole-response misses partial hallucinations; claim-level exposes them in metadata |
|
| 221 |
+
| Sentence-level context splitting | Full paragraph as NLI premise | NLI models calibrated on sentence pairs; paragraph inputs collapse entailment scores |
|
| 222 |
+
| KS two-sample test for drift | Evidently DataDriftPreset | Same statistical test, no extra dependency (scipy via scikit-learn) |
|
| 223 |
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
|
| 224 |
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
|
| 225 |
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
|
|
|
|
| 231 |
|
| 232 |
## Evaluation coverage vs RAGAS
|
| 233 |
|
| 234 |
+
| RAGAS metric | Coverage | Notes |
|
| 235 |
+
|---|---|---|
|
| 236 |
+
| faithfulness | β L1 (claim-level NLI) | `grade_faithfulness_decomposed()` β sentence-level cross-encoder |
|
| 237 |
+
| answer_relevancy | β L1 (cosine) + L2 (keyphrase) | Bi-encoder cosine; LLM-based in L2 |
|
| 238 |
+
| context_precision | partial β retrieval score in UI | No rank-weighted precision@k |
|
| 239 |
+
| context_recall | β L2 (keyphrase coverage) | Keyphrases as proxy for claim coverage |
|
| 240 |
+
| answer_correctness | β L2 (keyphrase + expected_answer) | |
|
| 241 |
+
|
| 242 |
+
## Drift detection
|
| 243 |
+
|
| 244 |
+
`eval/drift.py` detects distribution shift in grader scores between live traffic and the golden-dataset baseline.
|
| 245 |
+
|
| 246 |
+
```
|
| 247 |
+
reference = build_reference() # run all graders on golden-dataset expected_answers
|
| 248 |
+
current = build_current() # pull metric scores from telemetry._events
|
| 249 |
+
|
| 250 |
+
results = detect_drift(current, reference, alpha=0.05)
|
| 251 |
+
# β per-metric: ks_statistic, p_value, drifted, ref_mean, cur_mean, delta
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
**Statistical test:** KS two-sample (KolmogorovβSmirnov). Same test as Evidently `DataDriftPreset` for numerical columns. Detects any shift in distribution shape, not just mean change.
|
| 255 |
+
|
| 256 |
+
**Sensitivity:** with n_ref=24 golden pairs, KS test reaches p < 0.05 at ~40% traffic degradation (n_cur=40+). Smaller effects require larger current sample windows.
|
| 257 |
+
|
| 258 |
+
**What each metric's drift signals:**
|
| 259 |
+
|
| 260 |
+
| Metric | Drift means |
|
| 261 |
+
|--------|-------------|
|
| 262 |
+
| `faithfulness` | Model hallucinating more / KB stale / retrieval returning wrong docs |
|
| 263 |
+
| `answer_relevancy` | Query distribution shifted / model off-topic |
|
| 264 |
+
| `chain_terminology` | Terminology catalog misaligned with model outputs |
|
| 265 |
+
| `pii_leakage` / `token_budget` | Structural output format changed |
|
|
@@ -43,6 +43,34 @@ teardown fixtures.
|
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
## Alternative judge approaches considered
|
| 47 |
|
| 48 |
### Ollama (local LLM judge)
|
|
@@ -59,12 +87,12 @@ dependency on HF token entirely and allow offline eval runs.
|
|
| 59 |
in a structured format designed for rubric-based grading. It's a drop-in replacement
|
| 60 |
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
|
| 61 |
for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
|
| 62 |
-
The tradeoff vs. the current
|
| 63 |
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
|
| 64 |
which is more interpretable for audit and debugging.
|
| 65 |
|
| 66 |
-
**Why not used here:**
|
| 67 |
-
Prometheus would be the right choice if rationale logging is a compliance requirement.
|
| 68 |
|
| 69 |
---
|
| 70 |
|
|
@@ -88,14 +116,18 @@ Prometheus would be the right choice if rationale logging is a compliance requir
|
|
| 88 |
|
| 89 |
- **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
|
| 90 |
enough information" responses and returns score=1.0 β no factual claims, trivially faithful.
|
| 91 |
-
- **Partial grounding blind spot**: faithfulness now uses
|
| 92 |
-
|
| 93 |
-
|
|
|
|
| 94 |
- **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
|
| 95 |
log entry and sets `flagged: true` in the response payload. UI shows a red banner.
|
| 96 |
- **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
|
| 97 |
- **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
|
| 98 |
prompt injection, multi-doc synthesis, hallucination bait).
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
---
|
| 101 |
|
|
|
|
| 43 |
|
| 44 |
---
|
| 45 |
|
| 46 |
+
## NLI model selection β what was tried and why
|
| 47 |
+
|
| 48 |
+
The faithfulness grader went through three models before converging:
|
| 49 |
+
|
| 50 |
+
**Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) β purpose-built for RAG
|
| 51 |
+
faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
|
| 52 |
+
is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
|
| 53 |
+
zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
|
| 54 |
+
Diagnosed via weight inspection, not error message.
|
| 55 |
+
|
| 56 |
+
**`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) β 3-class NLI
|
| 57 |
+
(contradiction / entailment / neutral). Correct model family, wrong input format.
|
| 58 |
+
NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3β4
|
| 59 |
+
sentence KB paragraph as the premise causes entailment scores to collapse β verbatim
|
| 60 |
+
text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
|
| 61 |
+
across longer sequences in ways not seen during training.
|
| 62 |
+
|
| 63 |
+
**`cross-encoder/nli-deberta-v3-small` (sentence-level)** β same model, fixed by splitting
|
| 64 |
+
KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
|
| 65 |
+
("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
|
| 66 |
+
`entβ0.000`, contradictionβ1.0. This is the current implementation.
|
| 67 |
+
|
| 68 |
+
**Key insight:** the NLI model selection problem is a data format problem as much as a
|
| 69 |
+
model selection problem. The same model produces correct results at sentence level and
|
| 70 |
+
degenerate results at paragraph level.
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
## Alternative judge approaches considered
|
| 75 |
|
| 76 |
### Ollama (local LLM judge)
|
|
|
|
| 87 |
in a structured format designed for rubric-based grading. It's a drop-in replacement
|
| 88 |
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
|
| 89 |
for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
|
| 90 |
+
The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
|
| 91 |
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
|
| 92 |
which is more interpretable for audit and debugging.
|
| 93 |
|
| 94 |
+
**Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt
|
| 95 |
+
engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.
|
| 96 |
|
| 97 |
---
|
| 98 |
|
|
|
|
| 116 |
|
| 117 |
- **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
|
| 118 |
enough information" responses and returns score=1.0 β no factual claims, trivially faithful.
|
| 119 |
+
- **Partial grounding blind spot**: faithfulness now uses claim-level decomposition
|
| 120 |
+
(`grade_faithfulness_decomposed`). Response split into sentences; each verified
|
| 121 |
+
independently. Score = supported_claims / total_claims. A response with one hallucinated
|
| 122 |
+
sentence in three now scores 0.667, not 1.0.
|
| 123 |
- **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
|
| 124 |
log entry and sets `flagged: true` in the response payload. UI shows a red banner.
|
| 125 |
- **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
|
| 126 |
- **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
|
| 127 |
prompt injection, multi-doc synthesis, hallucination bait).
|
| 128 |
+
- **No drift detection**: added `eval/drift.py` β KS two-sample test per metric, compares
|
| 129 |
+
live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
|
| 130 |
+
at p < 0.05 with ~40% traffic degradation across 40+ events.
|
| 131 |
|
| 132 |
---
|
| 133 |
|
|
@@ -57,13 +57,23 @@ All tests are stateless β no cleanup required.
|
|
| 57 |
## Batch evaluation (L2)
|
| 58 |
|
| 59 |
```bash
|
| 60 |
-
make eval-retail # evaluate
|
| 61 |
-
make eval-pharma # evaluate
|
| 62 |
-
make eval # all
|
| 63 |
```
|
| 64 |
|
| 65 |
Reports are written to `eval/reports/`.
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
---
|
| 68 |
|
| 69 |
## Code quality
|
|
@@ -133,8 +143,9 @@ See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency
|
|
| 133 |
| PII Leakage | L1 live | Regex scan β binary |
|
| 134 |
| Token Budget | L1 live | Char count Γ· 4 |
|
| 135 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
|
| 136 |
-
| Faithfulness | L1 live |
|
| 137 |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
|
| 138 |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
|
|
|
|
| 139 |
|
| 140 |
**Core principle:** no single metric proves correctness. The combination does.
|
|
|
|
| 57 |
## Batch evaluation (L2)
|
| 58 |
|
| 59 |
```bash
|
| 60 |
+
make eval-retail # evaluate retail Q&A pairs, open HTML report
|
| 61 |
+
make eval-pharma # evaluate pharma Q&A pairs, open HTML report
|
| 62 |
+
make eval # all domains
|
| 63 |
```
|
| 64 |
|
| 65 |
Reports are written to `eval/reports/`.
|
| 66 |
|
| 67 |
+
**Drift detection** (no server required):
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
python eval/simulate_traffic.py # populate telemetry + run drift report
|
| 71 |
+
python eval/drift.py # drift report against live telemetry
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
Compares live grader score distributions against the golden-dataset baseline using KS tests.
|
| 75 |
+
Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.
|
| 76 |
+
|
| 77 |
---
|
| 78 |
|
| 79 |
## Code quality
|
|
|
|
| 143 |
| PII Leakage | L1 live | Regex scan β binary |
|
| 144 |
| Token Budget | L1 live | Char count Γ· 4 |
|
| 145 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
|
| 146 |
+
| Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
|
| 147 |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
|
| 148 |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
|
| 149 |
+
| Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |
|
| 150 |
|
| 151 |
**Core principle:** no single metric proves correctness. The combination does.
|
|
@@ -5,7 +5,7 @@ Metrics:
|
|
| 5 |
pii_leakage β regex scan for PII patterns in response
|
| 6 |
token_budget β response within allowed token ceiling
|
| 7 |
answer_relevancy β cosine similarity between query and response embeddings
|
| 8 |
-
faithfulness β
|
| 9 |
chain_terminology β deterministic: client-specific terms used (via RosettaStone)
|
| 10 |
"""
|
| 11 |
|
|
@@ -14,19 +14,20 @@ import re
|
|
| 14 |
from dataclasses import dataclass, field
|
| 15 |
from typing import Any
|
| 16 |
|
|
|
|
| 17 |
from config import EMBEDDER_MODEL
|
| 18 |
from rosetta import check_terminology
|
| 19 |
-
from sentence_transformers import SentenceTransformer
|
| 20 |
from sklearn.metrics.pairwise import cosine_similarity
|
| 21 |
-
from transformers import T5Tokenizer
|
| 22 |
-
from transformers import pipeline as hf_pipeline
|
| 23 |
|
| 24 |
log = logging.getLogger(__name__)
|
| 25 |
|
| 26 |
_embedder: SentenceTransformer | None = None
|
| 27 |
-
_nli_model:
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
def get_embedder() -> SentenceTransformer:
|
|
@@ -37,31 +38,11 @@ def get_embedder() -> SentenceTransformer:
|
|
| 37 |
return _embedder
|
| 38 |
|
| 39 |
|
| 40 |
-
def get_nli_model() ->
|
| 41 |
-
"""Return the shared
|
| 42 |
global _nli_model
|
| 43 |
if _nli_model is None:
|
| 44 |
-
|
| 45 |
-
# set β transformers 5.x requires it in _finalize_model_loading. Patch before load.
|
| 46 |
-
from transformers import PreTrainedModel
|
| 47 |
-
_orig = PreTrainedModel.mark_tied_weights_as_initialized
|
| 48 |
-
def _patched(self: Any, loading_info: Any) -> None:
|
| 49 |
-
if not hasattr(self, "all_tied_weights_keys"):
|
| 50 |
-
self.all_tied_weights_keys = {}
|
| 51 |
-
_orig(self, loading_info) # type: ignore[no-untyped-call]
|
| 52 |
-
PreTrainedModel.mark_tied_weights_as_initialized = _patched # type: ignore[method-assign]
|
| 53 |
-
|
| 54 |
-
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
| 55 |
-
_nli_model = hf_pipeline(
|
| 56 |
-
"text-classification",
|
| 57 |
-
model=NLI_MODEL,
|
| 58 |
-
tokenizer=tokenizer,
|
| 59 |
-
trust_remote_code=True,
|
| 60 |
-
truncation=True,
|
| 61 |
-
max_length=512,
|
| 62 |
-
)
|
| 63 |
-
|
| 64 |
-
PreTrainedModel.mark_tied_weights_as_initialized = _orig # type: ignore[method-assign]
|
| 65 |
return _nli_model
|
| 66 |
|
| 67 |
|
|
@@ -95,6 +76,8 @@ class GradeReport:
|
|
| 95 |
}
|
| 96 |
|
| 97 |
|
|
|
|
|
|
|
| 98 |
_PII_PATTERNS = [
|
| 99 |
(r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
|
| 100 |
(r"\b\d{16}\b", "credit card"),
|
|
@@ -163,8 +146,28 @@ def _strip_chunk_title(chunk: str) -> str:
|
|
| 163 |
return chunk
|
| 164 |
|
| 165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
def grade_faithfulness(response: str, context: str) -> GradeResult:
|
| 167 |
-
"""
|
| 168 |
if _is_refusal(response):
|
| 169 |
return GradeResult(
|
| 170 |
metric="faithfulness", passed=True, score=1.0,
|
|
@@ -175,24 +178,63 @@ def grade_faithfulness(response: str, context: str) -> GradeResult:
|
|
| 175 |
if not raw_chunks:
|
| 176 |
return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
|
| 177 |
chunks = [_strip_chunk_title(c) for c in raw_chunks]
|
| 178 |
-
|
| 179 |
-
pairs = [
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
scores
|
| 183 |
-
|
| 184 |
-
for r in results
|
| 185 |
-
]
|
| 186 |
-
score = float(max(scores))
|
| 187 |
-
passed = score >= FAITHFULNESS_THRESHOLD
|
| 188 |
return GradeResult(
|
| 189 |
metric="faithfulness",
|
| 190 |
-
passed=
|
| 191 |
score=score,
|
| 192 |
detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
|
| 193 |
)
|
| 194 |
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
def grade_chain_terminology(response: str, client: str) -> GradeResult:
|
| 197 |
"""Check that the response uses client-specific terms, not rival terminology."""
|
| 198 |
result = check_terminology(response, client)
|
|
@@ -226,7 +268,7 @@ def grade(
|
|
| 226 |
grade_pii_leakage(response),
|
| 227 |
grade_token_budget(response, token_budget),
|
| 228 |
grade_answer_relevancy(query, response),
|
| 229 |
-
|
| 230 |
grade_chain_terminology(response, client),
|
| 231 |
]
|
| 232 |
return report
|
|
|
|
| 5 |
pii_leakage β regex scan for PII patterns in response
|
| 6 |
token_budget β response within allowed token ceiling
|
| 7 |
answer_relevancy β cosine similarity between query and response embeddings
|
| 8 |
+
faithfulness β NLI cross-encoder: entailment score per (chunk, claim) pair
|
| 9 |
chain_terminology β deterministic: client-specific terms used (via RosettaStone)
|
| 10 |
"""
|
| 11 |
|
|
|
|
| 14 |
from dataclasses import dataclass, field
|
| 15 |
from typing import Any
|
| 16 |
|
| 17 |
+
import numpy as np
|
| 18 |
from config import EMBEDDER_MODEL
|
| 19 |
from rosetta import check_terminology
|
| 20 |
+
from sentence_transformers import CrossEncoder, SentenceTransformer
|
| 21 |
from sklearn.metrics.pairwise import cosine_similarity
|
|
|
|
|
|
|
| 22 |
|
| 23 |
log = logging.getLogger(__name__)
|
| 24 |
|
| 25 |
_embedder: SentenceTransformer | None = None
|
| 26 |
+
_nli_model: CrossEncoder | None = None
|
| 27 |
|
| 28 |
+
# cross-encoder/nli-deberta-v3-small: 3-class NLI, columns = [contradiction, entailment, neutral]
|
| 29 |
+
NLI_MODEL = "cross-encoder/nli-deberta-v3-small"
|
| 30 |
+
_NLI_ENTAILMENT_IDX = 1
|
| 31 |
|
| 32 |
|
| 33 |
def get_embedder() -> SentenceTransformer:
|
|
|
|
| 38 |
return _embedder
|
| 39 |
|
| 40 |
|
| 41 |
+
def get_nli_model() -> CrossEncoder:
|
| 42 |
+
"""Return the shared NLI cross-encoder, loading it on first call."""
|
| 43 |
global _nli_model
|
| 44 |
if _nli_model is None:
|
| 45 |
+
_nli_model = CrossEncoder(NLI_MODEL)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
return _nli_model
|
| 47 |
|
| 48 |
|
|
|
|
| 76 |
}
|
| 77 |
|
| 78 |
|
| 79 |
+
_SENTENCE_SPLIT = re.compile(r"(?<=[.!?])\s+")
|
| 80 |
+
|
| 81 |
_PII_PATTERNS = [
|
| 82 |
(r"\b\d{3}-\d{2}-\d{4}\b", "SSN"),
|
| 83 |
(r"\b\d{16}\b", "credit card"),
|
|
|
|
| 146 |
return chunk
|
| 147 |
|
| 148 |
|
| 149 |
+
def decompose_claims(response: str) -> list[str]:
|
| 150 |
+
"""Split response into atomic claim sentences (β₯3 words each)."""
|
| 151 |
+
sentences = _SENTENCE_SPLIT.split(response.strip())
|
| 152 |
+
return [s.strip() for s in sentences if len(s.split()) >= 3]
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def _context_sentences(chunks: list[str]) -> list[str]:
|
| 156 |
+
"""Flatten context chunks into individual sentences for sentence-level NLI scoring.
|
| 157 |
+
|
| 158 |
+
Cross-encoder NLI degrades on multi-sentence inputs β performance is calibrated
|
| 159 |
+
on single-sentence (premise, hypothesis) pairs matching the SNLI/MNLI training format.
|
| 160 |
+
"""
|
| 161 |
+
sentences = []
|
| 162 |
+
for chunk in chunks:
|
| 163 |
+
for s in _SENTENCE_SPLIT.split(chunk.strip()):
|
| 164 |
+
if len(s.split()) >= 3:
|
| 165 |
+
sentences.append(s.strip())
|
| 166 |
+
return sentences
|
| 167 |
+
|
| 168 |
+
|
| 169 |
def grade_faithfulness(response: str, context: str) -> GradeResult:
|
| 170 |
+
"""Whole-response faithfulness: max entailment score across all context chunks."""
|
| 171 |
if _is_refusal(response):
|
| 172 |
return GradeResult(
|
| 173 |
metric="faithfulness", passed=True, score=1.0,
|
|
|
|
| 178 |
if not raw_chunks:
|
| 179 |
return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
|
| 180 |
chunks = [_strip_chunk_title(c) for c in raw_chunks]
|
| 181 |
+
sentences = _context_sentences(chunks)
|
| 182 |
+
pairs = [(s, response) for s in sentences]
|
| 183 |
+
scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
|
| 184 |
+
entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
|
| 185 |
+
log.info("NLI entailment scores: %s", [round(float(s), 3) for s in entailment])
|
| 186 |
+
score = float(entailment.max())
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
return GradeResult(
|
| 188 |
metric="faithfulness",
|
| 189 |
+
passed=score >= FAITHFULNESS_THRESHOLD,
|
| 190 |
score=score,
|
| 191 |
detail=f"Faithfulness {score:.3f} (threshold: {FAITHFULNESS_THRESHOLD})",
|
| 192 |
)
|
| 193 |
|
| 194 |
|
| 195 |
+
def grade_faithfulness_decomposed(response: str, context: str) -> GradeResult:
|
| 196 |
+
"""Claim-level faithfulness: each sentence verified independently against context.
|
| 197 |
+
|
| 198 |
+
Supported claims / total claims β catches partial hallucinations missed by whole-response NLI.
|
| 199 |
+
"""
|
| 200 |
+
if _is_refusal(response):
|
| 201 |
+
return GradeResult(
|
| 202 |
+
metric="faithfulness", passed=True, score=1.0,
|
| 203 |
+
detail="Refusal β no factual claims to verify",
|
| 204 |
+
)
|
| 205 |
+
raw_chunks = [c.strip() for c in context.split("\n\n") if c.strip()]
|
| 206 |
+
if not raw_chunks:
|
| 207 |
+
return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
|
| 208 |
+
|
| 209 |
+
chunks = [_strip_chunk_title(c) for c in raw_chunks]
|
| 210 |
+
claims = decompose_claims(response)
|
| 211 |
+
if not claims:
|
| 212 |
+
return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No claims extracted")
|
| 213 |
+
|
| 214 |
+
sentences = _context_sentences(chunks)
|
| 215 |
+
model = get_nli_model()
|
| 216 |
+
claim_results: list[dict[str, Any]] = []
|
| 217 |
+
|
| 218 |
+
for claim in claims:
|
| 219 |
+
pairs = [(s, claim) for s in sentences]
|
| 220 |
+
scores_matrix: np.ndarray = model.predict(pairs, apply_softmax=True)
|
| 221 |
+
entailment: np.ndarray = scores_matrix[:, _NLI_ENTAILMENT_IDX]
|
| 222 |
+
best = float(entailment.max())
|
| 223 |
+
claim_results.append({"claim": claim, "score": round(best, 3), "supported": best >= FAITHFULNESS_THRESHOLD})
|
| 224 |
+
|
| 225 |
+
supported = sum(1 for c in claim_results if c["supported"])
|
| 226 |
+
score = supported / len(claim_results)
|
| 227 |
+
log.info("Claim decomposition: %d/%d supported (score=%.3f)", supported, len(claim_results), score)
|
| 228 |
+
|
| 229 |
+
return GradeResult(
|
| 230 |
+
metric="faithfulness",
|
| 231 |
+
passed=score >= FAITHFULNESS_THRESHOLD,
|
| 232 |
+
score=score,
|
| 233 |
+
detail=f"{supported}/{len(claim_results)} claims supported (threshold: {FAITHFULNESS_THRESHOLD})",
|
| 234 |
+
metadata={"claims": claim_results},
|
| 235 |
+
)
|
| 236 |
+
|
| 237 |
+
|
| 238 |
def grade_chain_terminology(response: str, client: str) -> GradeResult:
|
| 239 |
"""Check that the response uses client-specific terms, not rival terminology."""
|
| 240 |
result = check_terminology(response, client)
|
|
|
|
| 268 |
grade_pii_leakage(response),
|
| 269 |
grade_token_budget(response, token_budget),
|
| 270 |
grade_answer_relevancy(query, response),
|
| 271 |
+
grade_faithfulness_decomposed(response, context),
|
| 272 |
grade_chain_terminology(response, client),
|
| 273 |
]
|
| 274 |
return report
|
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Side-by-side comparison: whole-response faithfulness vs claim-level decomposition.
|
| 3 |
+
|
| 4 |
+
Each golden-dataset pair is run through both graders using the full domain KB as context
|
| 5 |
+
(simulates retrieval returning all docs β maximum pressure on the NLI signal).
|
| 6 |
+
|
| 7 |
+
Output: aligned table with per-pair scores + delta, plus summary distributions.
|
| 8 |
+
|
| 9 |
+
Usage:
|
| 10 |
+
cd /Users/praca/ai-response-validator && .venv/bin/python eval/compare_faithfulness.py
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import statistics
|
| 14 |
+
import sys
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
|
| 17 |
+
import yaml
|
| 18 |
+
|
| 19 |
+
sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
|
| 20 |
+
|
| 21 |
+
from grader import (
|
| 22 |
+
FAITHFULNESS_THRESHOLD,
|
| 23 |
+
grade_faithfulness,
|
| 24 |
+
grade_faithfulness_decomposed,
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
|
| 28 |
+
KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def _load_pairs() -> list[dict]:
|
| 32 |
+
return yaml.safe_load(DATASET_PATH.read_text())["pairs"]
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def _load_kb_context(domain: str) -> str:
|
| 36 |
+
path = KNOWLEDGE_ROOT / domain / "features.yaml"
|
| 37 |
+
data = yaml.safe_load(path.read_text())
|
| 38 |
+
chunks = [
|
| 39 |
+
f"[{doc['title']}]\n{doc['content'].strip()}"
|
| 40 |
+
for doc in data["documents"]
|
| 41 |
+
]
|
| 42 |
+
return "\n\n".join(chunks)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _fmt(score: float | None) -> str:
|
| 46 |
+
return f"{score:.3f}" if score is not None else " β "
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def run() -> None:
|
| 50 |
+
pairs = _load_pairs()
|
| 51 |
+
kb: dict[str, str] = {}
|
| 52 |
+
|
| 53 |
+
print(f"\nFaithfulness comparison β {len(pairs)} golden-dataset pairs")
|
| 54 |
+
print("Context: full domain KB (all docs, simulating broad retrieval)\n")
|
| 55 |
+
|
| 56 |
+
header = f"{'id':<20} {'whole':>7} {'decomp':>7} {'delta':>7} {'claims':>6} {'sup/tot':>7} note"
|
| 57 |
+
print(header)
|
| 58 |
+
print("-" * len(header))
|
| 59 |
+
|
| 60 |
+
whole_scores: list[float] = []
|
| 61 |
+
decomp_scores: list[float] = []
|
| 62 |
+
deltas: list[float] = []
|
| 63 |
+
refusals: list[str] = []
|
| 64 |
+
|
| 65 |
+
for pair in pairs:
|
| 66 |
+
pid = pair["id"]
|
| 67 |
+
domain = pair["domain"]
|
| 68 |
+
response = pair["expected_answer"].strip()
|
| 69 |
+
|
| 70 |
+
if domain not in kb:
|
| 71 |
+
kb[domain] = _load_kb_context(domain)
|
| 72 |
+
context = kb[domain]
|
| 73 |
+
|
| 74 |
+
w = grade_faithfulness(response, context)
|
| 75 |
+
d = grade_faithfulness_decomposed(response, context)
|
| 76 |
+
|
| 77 |
+
if "Refusal" in w.detail:
|
| 78 |
+
refusals.append(pid)
|
| 79 |
+
print(f"{pid:<20} {'REFUSAL':>7} {'REFUSAL':>7} {'':>7} {'':>6} {'':>7}")
|
| 80 |
+
continue
|
| 81 |
+
|
| 82 |
+
whole_scores.append(w.score)
|
| 83 |
+
decomp_scores.append(d.score)
|
| 84 |
+
delta = d.score - w.score
|
| 85 |
+
deltas.append(delta)
|
| 86 |
+
|
| 87 |
+
claims_meta = d.metadata.get("claims", [])
|
| 88 |
+
n_claims = len(claims_meta)
|
| 89 |
+
n_supported = sum(1 for c in claims_meta if c["supported"])
|
| 90 |
+
sup_tot = f"{n_supported}/{n_claims}"
|
| 91 |
+
|
| 92 |
+
note = ""
|
| 93 |
+
if abs(delta) >= 0.15:
|
| 94 |
+
note = "<-- gap"
|
| 95 |
+
|
| 96 |
+
sign = "+" if delta >= 0 else ""
|
| 97 |
+
print(
|
| 98 |
+
f"{pid:<20} {w.score:>7.3f} {d.score:>7.3f} {sign}{delta:>6.3f} {n_claims:>6} {sup_tot:>7} {note}"
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
print("-" * len(header))
|
| 102 |
+
print()
|
| 103 |
+
|
| 104 |
+
if whole_scores:
|
| 105 |
+
print("Score distributions (refusals excluded):\n")
|
| 106 |
+
for name, scores in [("whole_response", whole_scores), ("decomposed", decomp_scores)]:
|
| 107 |
+
below = sum(1 for s in scores if s < FAITHFULNESS_THRESHOLD)
|
| 108 |
+
print(
|
| 109 |
+
f" {name:<16} "
|
| 110 |
+
f"min={min(scores):.3f} "
|
| 111 |
+
f"p25={sorted(scores)[len(scores)//4]:.3f} "
|
| 112 |
+
f"median={statistics.median(scores):.3f} "
|
| 113 |
+
f"p75={sorted(scores)[3*len(scores)//4]:.3f} "
|
| 114 |
+
f"max={max(scores):.3f} "
|
| 115 |
+
f"below_threshold={below}/{len(scores)}"
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
print()
|
| 119 |
+
neg_delta = sum(1 for d in deltas if d < -0.05)
|
| 120 |
+
mean_abs = statistics.mean(abs(d) for d in deltas)
|
| 121 |
+
print(f" mean |delta| : {mean_abs:.3f}")
|
| 122 |
+
print(f" decomp < whole : {neg_delta}/{len(deltas)} pairs (whole-response was optimistic here)")
|
| 123 |
+
print(f" threshold : {FAITHFULNESS_THRESHOLD}")
|
| 124 |
+
|
| 125 |
+
if refusals:
|
| 126 |
+
print(f"\n Refusals (auto-pass, excluded from stats): {', '.join(refusals)}")
|
| 127 |
+
|
| 128 |
+
print()
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
if __name__ == "__main__":
|
| 132 |
+
run()
|
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Drift detection: compare live grader score distributions against the golden-dataset baseline.
|
| 3 |
+
|
| 4 |
+
Answers: has answer quality shifted since the reference was established?
|
| 5 |
+
Catches: model updates, KB staleness, query distribution shift, threshold miscalibration.
|
| 6 |
+
|
| 7 |
+
Statistical test: KS two-sample (same as Evidently DataDriftPreset for numerical columns).
|
| 8 |
+
- H0: current and reference are drawn from the same distribution
|
| 9 |
+
- H1: distributions differ
|
| 10 |
+
- Drifted if p_value < alpha (default 0.05)
|
| 11 |
+
|
| 12 |
+
Reference: golden-dataset expected_answer scores (known-good baseline).
|
| 13 |
+
Current: in-memory telemetry._events from the running API session.
|
| 14 |
+
|
| 15 |
+
Usage:
|
| 16 |
+
cd /Users/praca/ai-response-validator && .venv/bin/python eval/drift.py
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
import sys
|
| 20 |
+
from dataclasses import dataclass
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
|
| 23 |
+
import yaml
|
| 24 |
+
from scipy.stats import ks_2samp
|
| 25 |
+
|
| 26 |
+
sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
|
| 27 |
+
|
| 28 |
+
from grader import (
|
| 29 |
+
grade_answer_relevancy,
|
| 30 |
+
grade_chain_terminology,
|
| 31 |
+
grade_faithfulness_decomposed,
|
| 32 |
+
grade_pii_leakage,
|
| 33 |
+
grade_token_budget,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
|
| 37 |
+
KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
|
| 38 |
+
|
| 39 |
+
METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
|
| 40 |
+
ALPHA = 0.05
|
| 41 |
+
MIN_CURRENT_SAMPLES = 5
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
@dataclass(slots=True)
|
| 45 |
+
class MetricDrift:
|
| 46 |
+
metric: str
|
| 47 |
+
ks_statistic: float
|
| 48 |
+
p_value: float
|
| 49 |
+
drifted: bool
|
| 50 |
+
ref_mean: float
|
| 51 |
+
cur_mean: float
|
| 52 |
+
ref_n: int
|
| 53 |
+
cur_n: int
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def _load_kb_context(domain: str) -> str:
|
| 57 |
+
path = KNOWLEDGE_ROOT / domain / "features.yaml"
|
| 58 |
+
data = yaml.safe_load(path.read_text())
|
| 59 |
+
chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
|
| 60 |
+
return "\n\n".join(chunks)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
Scores = dict[str, list[float]]
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def build_reference() -> Scores:
|
| 67 |
+
"""Score every golden-dataset pair with all graders."""
|
| 68 |
+
pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
|
| 69 |
+
kb: dict[str, str] = {}
|
| 70 |
+
scores: Scores = {m: [] for m in METRICS}
|
| 71 |
+
|
| 72 |
+
for pair in pairs:
|
| 73 |
+
response = pair["expected_answer"].strip()
|
| 74 |
+
domain = pair["domain"]
|
| 75 |
+
if domain not in kb:
|
| 76 |
+
kb[domain] = _load_kb_context(domain)
|
| 77 |
+
|
| 78 |
+
scores["pii_leakage"].append(grade_pii_leakage(response).score)
|
| 79 |
+
scores["token_budget"].append(grade_token_budget(response).score)
|
| 80 |
+
scores["answer_relevancy"].append(grade_answer_relevancy(pair["question"], response).score)
|
| 81 |
+
scores["faithfulness"].append(grade_faithfulness_decomposed(response, kb[domain]).score)
|
| 82 |
+
scores["chain_terminology"].append(grade_chain_terminology(response, pair["client"]).score)
|
| 83 |
+
|
| 84 |
+
return scores
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def build_current() -> Scores:
|
| 88 |
+
"""Pull metric scores from the in-memory telemetry buffer."""
|
| 89 |
+
import telemetry
|
| 90 |
+
|
| 91 |
+
with telemetry._lock:
|
| 92 |
+
events = list(telemetry._events)
|
| 93 |
+
|
| 94 |
+
scores: Scores = {m: [] for m in METRICS}
|
| 95 |
+
for event in events:
|
| 96 |
+
if "metrics" not in event:
|
| 97 |
+
continue
|
| 98 |
+
if any(event["metrics"].get(m) is None for m in METRICS):
|
| 99 |
+
continue
|
| 100 |
+
for m in METRICS:
|
| 101 |
+
scores[m].append(float(event["metrics"][m]))
|
| 102 |
+
|
| 103 |
+
return scores
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
def detect_drift(
|
| 107 |
+
current: Scores,
|
| 108 |
+
reference: Scores,
|
| 109 |
+
alpha: float = ALPHA,
|
| 110 |
+
) -> list[MetricDrift]:
|
| 111 |
+
"""Run KS two-sample test per metric. Skips metrics with fewer than MIN_CURRENT_SAMPLES."""
|
| 112 |
+
results: list[MetricDrift] = []
|
| 113 |
+
|
| 114 |
+
for metric in METRICS:
|
| 115 |
+
ref_col = reference.get(metric, [])
|
| 116 |
+
cur_col = current.get(metric, [])
|
| 117 |
+
|
| 118 |
+
if len(cur_col) < MIN_CURRENT_SAMPLES or len(ref_col) == 0:
|
| 119 |
+
continue
|
| 120 |
+
|
| 121 |
+
import numpy as np
|
| 122 |
+
ref_arr = np.array(ref_col, dtype=float)
|
| 123 |
+
cur_arr = np.array(cur_col, dtype=float)
|
| 124 |
+
|
| 125 |
+
stat, pval = ks_2samp(ref_arr, cur_arr)
|
| 126 |
+
results.append(MetricDrift(
|
| 127 |
+
metric=metric,
|
| 128 |
+
ks_statistic=round(float(stat), 4),
|
| 129 |
+
p_value=round(float(pval), 4),
|
| 130 |
+
drifted=bool(pval < alpha),
|
| 131 |
+
ref_mean=round(float(ref_arr.mean()), 4),
|
| 132 |
+
cur_mean=round(float(cur_arr.mean()), 4),
|
| 133 |
+
ref_n=len(ref_arr),
|
| 134 |
+
cur_n=len(cur_arr),
|
| 135 |
+
))
|
| 136 |
+
|
| 137 |
+
return results
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def report_drift(results: list[MetricDrift], alpha: float = ALPHA) -> None:
|
| 141 |
+
header = (
|
| 142 |
+
f"{'metric':<22} {'ks_stat':>7} {'p_value':>7} {'status':>10}"
|
| 143 |
+
f" {'ref_mean':>8} {'cur_mean':>8} {'delta':>7}"
|
| 144 |
+
)
|
| 145 |
+
print(header)
|
| 146 |
+
print("-" * len(header))
|
| 147 |
+
|
| 148 |
+
for r in results:
|
| 149 |
+
status = "DRIFT <--" if r.drifted else "ok"
|
| 150 |
+
delta = r.cur_mean - r.ref_mean
|
| 151 |
+
sign = "+" if delta >= 0 else ""
|
| 152 |
+
print(
|
| 153 |
+
f"{r.metric:<22} {r.ks_statistic:>7.4f} {r.p_value:>7.4f} {status:>10}"
|
| 154 |
+
f" {r.ref_mean:>8.4f} {r.cur_mean:>8.4f} {sign}{delta:>6.4f}"
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
drifted = [r for r in results if r.drifted]
|
| 158 |
+
print(f"\nOverall: {len(drifted)}/{len(results)} metrics drifted (alpha={alpha})")
|
| 159 |
+
|
| 160 |
+
if drifted:
|
| 161 |
+
print("\nDrifted metrics:")
|
| 162 |
+
for r in drifted:
|
| 163 |
+
direction = "degraded" if r.cur_mean < r.ref_mean else "improved"
|
| 164 |
+
print(f" {r.metric}: {direction} ({r.ref_mean:.3f} β {r.cur_mean:.3f})")
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
def run() -> None:
|
| 168 |
+
print("\nBuilding reference distribution from golden-dataset.yaml...")
|
| 169 |
+
reference = build_reference()
|
| 170 |
+
ref_n = len(next(iter(reference.values()), []))
|
| 171 |
+
print(f"Reference: {ref_n} pairs\n")
|
| 172 |
+
|
| 173 |
+
current = build_current()
|
| 174 |
+
|
| 175 |
+
cur_n = len(next(iter(current.values()), []))
|
| 176 |
+
if cur_n < MIN_CURRENT_SAMPLES:
|
| 177 |
+
import numpy as np
|
| 178 |
+
print(
|
| 179 |
+
f"Current: {cur_n} telemetry event(s) β need β₯{MIN_CURRENT_SAMPLES} to run KS test.\n"
|
| 180 |
+
f"Start the API and run some queries, then re-run drift.py.\n\n"
|
| 181 |
+
f"Reference distribution (golden baseline):\n"
|
| 182 |
+
)
|
| 183 |
+
for m in METRICS:
|
| 184 |
+
vals = np.array(reference[m])
|
| 185 |
+
print(f" {m:<22} mean={vals.mean():.3f} std={vals.std():.3f} min={vals.min():.3f} max={vals.max():.3f}")
|
| 186 |
+
return
|
| 187 |
+
|
| 188 |
+
print(f"Current: {cur_n} telemetry events\n")
|
| 189 |
+
results = detect_drift(current, reference)
|
| 190 |
+
|
| 191 |
+
if not results:
|
| 192 |
+
print("No metrics had enough data for KS test.\n")
|
| 193 |
+
return
|
| 194 |
+
|
| 195 |
+
report_drift(results)
|
| 196 |
+
print()
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
if __name__ == "__main__":
|
| 200 |
+
run()
|
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Populate telemetry with simulated traffic, then run drift detection.
|
| 3 |
+
|
| 4 |
+
Two batches:
|
| 5 |
+
clean β golden-dataset expected_answers (should match reference distribution)
|
| 6 |
+
dirty β same questions, hallucinated responses (should show faithfulness drift)
|
| 7 |
+
|
| 8 |
+
Bypasses the API entirely: runs graders + telemetry.record() directly.
|
| 9 |
+
|
| 10 |
+
Usage:
|
| 11 |
+
cd /Users/praca/ai-response-validator && .venv/bin/python eval/simulate_traffic.py
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import sys
|
| 15 |
+
import time
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
import yaml
|
| 19 |
+
|
| 20 |
+
sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
|
| 21 |
+
|
| 22 |
+
import telemetry
|
| 23 |
+
from config import CLIENT_DOMAIN
|
| 24 |
+
from grader import GradeReport, grade
|
| 25 |
+
|
| 26 |
+
DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
|
| 27 |
+
KNOWLEDGE_ROOT = Path(__file__).parent.parent / "knowledge"
|
| 28 |
+
|
| 29 |
+
# Hallucinated responses β plausible-sounding but contradicts KB facts
|
| 30 |
+
HALLUCINATED: dict[str, str] = {
|
| 31 |
+
# retail β NovaMart
|
| 32 |
+
"retail-nm-001": (
|
| 33 |
+
"When a product runs out of stock, the system automatically places a reorder after 72 hours "
|
| 34 |
+
"with no alerts sent to any manager. The supplier is notified only at month-end review."
|
| 35 |
+
),
|
| 36 |
+
"retail-nm-002": (
|
| 37 |
+
"To add a new supplier, send an email to the procurement team with the company name. "
|
| 38 |
+
"No tax ID or payment terms are required at this stage. "
|
| 39 |
+
"Purchase orders can be created immediately without waiting for validation."
|
| 40 |
+
),
|
| 41 |
+
"retail-nm-003": (
|
| 42 |
+
"Feature flags are permanent once enabled and cannot be disabled without a code deployment. "
|
| 43 |
+
"There is no expiry date or activation scope. Any employee can enable a flag in production."
|
| 44 |
+
),
|
| 45 |
+
"retail-nm-004": (
|
| 46 |
+
"The authoritative source for product information is the pricing portal. "
|
| 47 |
+
"SKU records are updated manually once per week by the merchandising team. "
|
| 48 |
+
"Archived products can be reactivated instantly by any store manager."
|
| 49 |
+
),
|
| 50 |
+
"retail-nm-005": (
|
| 51 |
+
"Price changes take effect immediately upon submission with no approval required. "
|
| 52 |
+
"There is no sync window; prices update in real time. "
|
| 53 |
+
"Emergency corrections are handled automatically without escalation."
|
| 54 |
+
),
|
| 55 |
+
# retail β ShelfWise
|
| 56 |
+
"retail-sw-001": (
|
| 57 |
+
"An out-of-stock alert fires only after a manual stock check is initiated by a store manager. "
|
| 58 |
+
"The alert is sent exclusively to the regional director. "
|
| 59 |
+
"No escalation occurs if the alert is unacknowledged."
|
| 60 |
+
),
|
| 61 |
+
"retail-sw-002": (
|
| 62 |
+
"Feature toggles are permanent once enabled. "
|
| 63 |
+
"There is no activation scope and no expiry date requirement. "
|
| 64 |
+
"Any user can enable toggles in production without sign-off."
|
| 65 |
+
),
|
| 66 |
+
"retail-sw-004": (
|
| 67 |
+
"Compliance reports are editable for up to 30 days after creation and are stored for 2 years. "
|
| 68 |
+
"Any user can access compliance reports from the standard dashboard. "
|
| 69 |
+
"Reports are generated on demand only."
|
| 70 |
+
),
|
| 71 |
+
"retail-sw-005": (
|
| 72 |
+
"Product catalog updates require manual approval for each SKU and can take up to 48 hours. "
|
| 73 |
+
"Deactivated products are permanently deleted and cannot be recovered."
|
| 74 |
+
),
|
| 75 |
+
# pharma β ClinixOne
|
| 76 |
+
"pharma-cx-001": (
|
| 77 |
+
"Prior authorization is optional and payers respond within 7 business days. "
|
| 78 |
+
"Denied requests cannot be appealed and the prescriber must choose an alternative drug."
|
| 79 |
+
),
|
| 80 |
+
"pharma-cx-003": (
|
| 81 |
+
"Adverse events must be reported to regulators within 30 days for all event types. "
|
| 82 |
+
"A safety signal is raised automatically by the system when 3 or more events occur. "
|
| 83 |
+
"Expected events do not require regulatory reporting."
|
| 84 |
+
),
|
| 85 |
+
"pharma-cx-004": (
|
| 86 |
+
"Clinical trials have two phases: Phase I for safety and Phase II for market approval. "
|
| 87 |
+
"Enrollment eligibility is determined by the treating physician with no formal criteria."
|
| 88 |
+
),
|
| 89 |
+
# pharma β PharmaLink
|
| 90 |
+
"pharma-pl-001": (
|
| 91 |
+
"Formulary pre-approval is automatically granted for all branded drugs. "
|
| 92 |
+
"The payer responds within 30 days and denied requests cannot be appealed."
|
| 93 |
+
),
|
| 94 |
+
"pharma-pl-003": (
|
| 95 |
+
"The formulary has two tiers: generic and branded. "
|
| 96 |
+
"Moving a drug to a higher tier requires a 7-day notice to prescribers. "
|
| 97 |
+
"Tier assignment is reviewed every 5 years."
|
| 98 |
+
),
|
| 99 |
+
"pharma-pl-004": (
|
| 100 |
+
"A prescribing pathway is a marketing document produced by pharmaceutical companies. "
|
| 101 |
+
"Pathways are reviewed every 5 years and payers do not use them in coverage decisions. "
|
| 102 |
+
"Deviation from a pathway requires no documentation."
|
| 103 |
+
),
|
| 104 |
+
"pharma-pl-005": (
|
| 105 |
+
"Enrollment authorization is a formality β patients sign a standard waiver. "
|
| 106 |
+
"Consent is obtained after the first study procedure, not before. "
|
| 107 |
+
"Protocol changes do not require re-consent from existing participants."
|
| 108 |
+
),
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def _load_kb_context(domain: str) -> str:
|
| 113 |
+
path = KNOWLEDGE_ROOT / domain / "features.yaml"
|
| 114 |
+
data = yaml.safe_load(path.read_text())
|
| 115 |
+
chunks = [f"[{doc['title']}]\n{doc['content'].strip()}" for doc in data["documents"]]
|
| 116 |
+
return "\n\n".join(chunks)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def _record(pair: dict, response: str, context: str, tag: str) -> GradeReport:
|
| 120 |
+
client = pair["client"]
|
| 121 |
+
report = grade(
|
| 122 |
+
query=pair["question"],
|
| 123 |
+
response=response,
|
| 124 |
+
context=context,
|
| 125 |
+
client=client,
|
| 126 |
+
)
|
| 127 |
+
telemetry.record(
|
| 128 |
+
client=client,
|
| 129 |
+
domain=pair["domain"],
|
| 130 |
+
query_len=len(pair["question"].split()),
|
| 131 |
+
latency_ms={"retrieve": 12.0, "generate": 180.0, "grade": 45.0},
|
| 132 |
+
report=report,
|
| 133 |
+
docs_retrieved=3,
|
| 134 |
+
min_retrieval_score=0.72,
|
| 135 |
+
)
|
| 136 |
+
status = "PASS" if report.overall else "FAIL"
|
| 137 |
+
faith = next(r for r in report.results if r.metric == "faithfulness")
|
| 138 |
+
print(f" [{tag}] {pair['id']:<20} {status} faith={faith.score:.3f} {faith.detail}")
|
| 139 |
+
return report
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
def run() -> None:
|
| 143 |
+
pairs = yaml.safe_load(DATASET_PATH.read_text())["pairs"]
|
| 144 |
+
kb: dict[str, str] = {}
|
| 145 |
+
|
| 146 |
+
# ββ Batch 1: clean traffic ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 147 |
+
print("\nββ Batch 1: clean traffic (expected answers) ββ\n")
|
| 148 |
+
for pair in pairs:
|
| 149 |
+
domain = pair["domain"]
|
| 150 |
+
if domain not in kb:
|
| 151 |
+
kb[domain] = _load_kb_context(domain)
|
| 152 |
+
response = pair["expected_answer"].strip()
|
| 153 |
+
_record(pair, response, kb[domain], "clean")
|
| 154 |
+
time.sleep(0.05)
|
| 155 |
+
|
| 156 |
+
# ββ Batch 2: dirty traffic (hallucinated responses) βββββββββββββββββββββ
|
| 157 |
+
print("\nββ Batch 2: dirty traffic (hallucinated responses) ββ\n")
|
| 158 |
+
dirty_pairs = [p for p in pairs if p["id"] in HALLUCINATED]
|
| 159 |
+
for pair in dirty_pairs:
|
| 160 |
+
domain = pair["domain"]
|
| 161 |
+
response = HALLUCINATED[pair["id"]]
|
| 162 |
+
_record(pair, response, kb[domain], "dirty")
|
| 163 |
+
time.sleep(0.05)
|
| 164 |
+
|
| 165 |
+
total = telemetry.live_stats()["total_queries"]
|
| 166 |
+
print(f"\nTelemetry buffer: {total} events ({len(pairs)} clean + {len(dirty_pairs)} dirty)\n")
|
| 167 |
+
|
| 168 |
+
# ββ Drift detection βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 169 |
+
print("=" * 60)
|
| 170 |
+
print("Running drift detection vs golden-dataset baseline...")
|
| 171 |
+
print("=" * 60)
|
| 172 |
+
|
| 173 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 174 |
+
from drift import build_current, build_reference, detect_drift, report_drift
|
| 175 |
+
|
| 176 |
+
print("\nBuilding reference distribution...")
|
| 177 |
+
reference = build_reference()
|
| 178 |
+
|
| 179 |
+
current = build_current()
|
| 180 |
+
cur_n = len(next(iter(current.values()), []))
|
| 181 |
+
print(f"Reference: {len(next(iter(reference.values())))} pairs")
|
| 182 |
+
print(f"Current: {cur_n} events\n")
|
| 183 |
+
|
| 184 |
+
results = detect_drift(current, reference)
|
| 185 |
+
report_drift(results)
|
| 186 |
+
print()
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
if __name__ == "__main__":
|
| 190 |
+
run()
|
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Unit tests for drift detection β detect_drift() only.
|
| 3 |
+
No model loading, no IO, no telemetry.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
import numpy as np
|
| 10 |
+
import pytest
|
| 11 |
+
|
| 12 |
+
sys.path.insert(0, str(Path(__file__).parent.parent.parent / "eval"))
|
| 13 |
+
|
| 14 |
+
from drift import ALPHA, MIN_CURRENT_SAMPLES, MetricDrift, detect_drift
|
| 15 |
+
|
| 16 |
+
METRICS = ["faithfulness", "answer_relevancy", "pii_leakage", "token_budget", "chain_terminology"]
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def _scores(n: int, **col_values: list[float]) -> dict[str, list[float]]:
|
| 20 |
+
"""Build a Scores dict with fixed values per column; defaults to 0.9 for others."""
|
| 21 |
+
data: dict[str, list[float]] = {}
|
| 22 |
+
for metric in METRICS:
|
| 23 |
+
data[metric] = col_values.get(metric, [0.9] * n)
|
| 24 |
+
return data
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
class TestDetectDrift:
|
| 28 |
+
def test_identical_distributions_no_drift(self) -> None:
|
| 29 |
+
rng = np.random.default_rng(42)
|
| 30 |
+
scores = rng.uniform(0.5, 1.0, 50).tolist()
|
| 31 |
+
ref = _scores(50, faithfulness=scores)
|
| 32 |
+
cur = _scores(50, faithfulness=scores)
|
| 33 |
+
results = detect_drift(cur, ref)
|
| 34 |
+
faith = next(r for r in results if r.metric == "faithfulness")
|
| 35 |
+
assert faith.drifted is False
|
| 36 |
+
|
| 37 |
+
def test_shifted_distribution_detected(self) -> None:
|
| 38 |
+
ref = _scores(50, faithfulness=[0.9] * 50)
|
| 39 |
+
cur = _scores(50, faithfulness=[0.1] * 50)
|
| 40 |
+
results = detect_drift(cur, ref)
|
| 41 |
+
faith = next(r for r in results if r.metric == "faithfulness")
|
| 42 |
+
assert faith.drifted is True
|
| 43 |
+
assert faith.p_value < ALPHA
|
| 44 |
+
|
| 45 |
+
def test_below_min_samples_excluded(self) -> None:
|
| 46 |
+
ref = _scores(50)
|
| 47 |
+
cur = _scores(MIN_CURRENT_SAMPLES - 1)
|
| 48 |
+
results = detect_drift(cur, ref)
|
| 49 |
+
assert results == []
|
| 50 |
+
|
| 51 |
+
def test_exactly_min_samples_included(self) -> None:
|
| 52 |
+
ref = _scores(50)
|
| 53 |
+
cur = _scores(MIN_CURRENT_SAMPLES)
|
| 54 |
+
results = detect_drift(cur, ref)
|
| 55 |
+
assert len(results) == len(METRICS)
|
| 56 |
+
|
| 57 |
+
def test_ks_statistic_in_range(self) -> None:
|
| 58 |
+
ref = _scores(50, faithfulness=[0.9] * 50)
|
| 59 |
+
cur = _scores(50, faithfulness=[0.1] * 50)
|
| 60 |
+
results = detect_drift(cur, ref)
|
| 61 |
+
faith = next(r for r in results if r.metric == "faithfulness")
|
| 62 |
+
assert 0.0 <= faith.ks_statistic <= 1.0
|
| 63 |
+
|
| 64 |
+
def test_means_computed_correctly(self) -> None:
|
| 65 |
+
ref = _scores(10, faithfulness=[0.8] * 10)
|
| 66 |
+
cur = _scores(10, faithfulness=[0.4] * 10)
|
| 67 |
+
results = detect_drift(cur, ref)
|
| 68 |
+
faith = next(r for r in results if r.metric == "faithfulness")
|
| 69 |
+
assert faith.ref_mean == pytest.approx(0.8, abs=1e-3)
|
| 70 |
+
assert faith.cur_mean == pytest.approx(0.4, abs=1e-3)
|
| 71 |
+
|
| 72 |
+
def test_all_metrics_returned(self) -> None:
|
| 73 |
+
ref = _scores(30)
|
| 74 |
+
cur = _scores(30)
|
| 75 |
+
result_names = {r.metric for r in detect_drift(cur, ref)}
|
| 76 |
+
assert result_names == set(METRICS)
|
| 77 |
+
|
| 78 |
+
def test_result_is_metric_drift_dataclass(self) -> None:
|
| 79 |
+
ref = _scores(20)
|
| 80 |
+
cur = _scores(20)
|
| 81 |
+
for r in detect_drift(cur, ref):
|
| 82 |
+
assert isinstance(r, MetricDrift)
|
| 83 |
+
assert isinstance(r.drifted, bool)
|
| 84 |
+
assert isinstance(r.ks_statistic, float)
|
| 85 |
+
assert isinstance(r.p_value, float)
|
| 86 |
+
|
| 87 |
+
def test_custom_alpha_respected(self) -> None:
|
| 88 |
+
rng = np.random.default_rng(0)
|
| 89 |
+
ref = _scores(50, faithfulness=rng.uniform(0.7, 1.0, 50).tolist())
|
| 90 |
+
cur = _scores(50, faithfulness=rng.uniform(0.4, 0.7, 50).tolist())
|
| 91 |
+
strict = detect_drift(cur, ref, alpha=0.001)
|
| 92 |
+
lenient = detect_drift(cur, ref, alpha=0.999)
|
| 93 |
+
faith_strict = next(r for r in strict if r.metric == "faithfulness")
|
| 94 |
+
faith_lenient = next(r for r in lenient if r.metric == "faithfulness")
|
| 95 |
+
assert faith_lenient.drifted or not faith_strict.drifted
|
| 96 |
+
|
| 97 |
+
def test_missing_metric_column_skipped(self) -> None:
|
| 98 |
+
ref: dict[str, list[float]] = {"faithfulness": [0.9] * 20}
|
| 99 |
+
cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
|
| 100 |
+
results = detect_drift(cur, ref)
|
| 101 |
+
assert all(r.metric == "faithfulness" for r in results)
|
| 102 |
+
assert len(results) == 1
|
| 103 |
+
|
| 104 |
+
def test_empty_reference_skipped(self) -> None:
|
| 105 |
+
ref: dict[str, list[float]] = {"faithfulness": []}
|
| 106 |
+
cur: dict[str, list[float]] = {"faithfulness": [0.4] * 20}
|
| 107 |
+
results = detect_drift(cur, ref)
|
| 108 |
+
assert results == []
|
| 109 |
+
|
| 110 |
+
def test_sample_counts_in_result(self) -> None:
|
| 111 |
+
ref = _scores(30)
|
| 112 |
+
cur = _scores(10)
|
| 113 |
+
results = detect_drift(cur, ref)
|
| 114 |
+
for r in results:
|
| 115 |
+
assert r.ref_n == 30
|
| 116 |
+
assert r.cur_n == 10
|
|
@@ -11,10 +11,17 @@ import pytest
|
|
| 11 |
|
| 12 |
sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
from grader import (
|
| 15 |
grade_pii_leakage,
|
| 16 |
grade_token_budget,
|
| 17 |
grade_chain_terminology,
|
|
|
|
|
|
|
|
|
|
| 18 |
TOKEN_BUDGET,
|
| 19 |
)
|
| 20 |
|
|
@@ -138,3 +145,111 @@ class TestChainTerminology:
|
|
| 138 |
)
|
| 139 |
assert result.passed is False
|
| 140 |
assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
|
| 13 |
|
| 14 |
+
from unittest.mock import MagicMock, patch
|
| 15 |
+
|
| 16 |
+
import numpy as np
|
| 17 |
+
|
| 18 |
from grader import (
|
| 19 |
grade_pii_leakage,
|
| 20 |
grade_token_budget,
|
| 21 |
grade_chain_terminology,
|
| 22 |
+
decompose_claims,
|
| 23 |
+
grade_faithfulness_decomposed,
|
| 24 |
+
FAITHFULNESS_THRESHOLD,
|
| 25 |
TOKEN_BUDGET,
|
| 26 |
)
|
| 27 |
|
|
|
|
| 145 |
)
|
| 146 |
assert result.passed is False
|
| 147 |
assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
# ββ decompose_claims ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 151 |
+
|
| 152 |
+
class TestDecomposeClaims:
|
| 153 |
+
def test_single_sentence(self) -> None:
|
| 154 |
+
claims = decompose_claims("The product is in stock.")
|
| 155 |
+
assert claims == ["The product is in stock."]
|
| 156 |
+
|
| 157 |
+
def test_multi_sentence_split(self) -> None:
|
| 158 |
+
claims = decompose_claims("The product is in stock. It costs five dollars. Delivery takes two days.")
|
| 159 |
+
assert len(claims) == 3
|
| 160 |
+
|
| 161 |
+
def test_fragments_under_three_words_excluded(self) -> None:
|
| 162 |
+
claims = decompose_claims("Yes. The product is available in all sizes.")
|
| 163 |
+
assert all(len(c.split()) >= 3 for c in claims)
|
| 164 |
+
|
| 165 |
+
def test_exclamation_and_question_split(self) -> None:
|
| 166 |
+
claims = decompose_claims("Stock is low! Would you like to reorder? The threshold is five units.")
|
| 167 |
+
assert len(claims) == 3
|
| 168 |
+
|
| 169 |
+
def test_empty_string_returns_empty(self) -> None:
|
| 170 |
+
assert decompose_claims("") == []
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
# ββ grade_faithfulness_decomposed ββββββββββββββββββββββββββββββββββββββββββββ
|
| 174 |
+
|
| 175 |
+
def _make_nli(entailment: float) -> MagicMock:
|
| 176 |
+
"""Mock CrossEncoder whose predict() always returns the given entailment score."""
|
| 177 |
+
mock = MagicMock()
|
| 178 |
+
# columns: [contradiction, entailment, neutral]
|
| 179 |
+
mock.predict = MagicMock(
|
| 180 |
+
side_effect=lambda pairs, **kw: np.array([[0.1, entailment, 0.0]] * len(pairs))
|
| 181 |
+
)
|
| 182 |
+
return mock
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
CONTEXT = "The product costs five dollars.\n\nDelivery takes two days."
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
class TestGradeFaithfulnessDecomposed:
|
| 189 |
+
def test_all_claims_supported_passes(self) -> None:
|
| 190 |
+
with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
|
| 191 |
+
result = grade_faithfulness_decomposed(
|
| 192 |
+
"The product costs five dollars. Delivery takes two days.", CONTEXT
|
| 193 |
+
)
|
| 194 |
+
assert result.passed is True
|
| 195 |
+
assert result.score == 1.0
|
| 196 |
+
assert result.metadata["claims"][0]["supported"] is True
|
| 197 |
+
|
| 198 |
+
def test_all_claims_unsupported_fails(self) -> None:
|
| 199 |
+
with patch("grader.get_nli_model", return_value=_make_nli(0.1)):
|
| 200 |
+
result = grade_faithfulness_decomposed(
|
| 201 |
+
"The product costs five dollars. Delivery takes two days.", CONTEXT
|
| 202 |
+
)
|
| 203 |
+
assert result.passed is False
|
| 204 |
+
assert result.score == 0.0
|
| 205 |
+
|
| 206 |
+
def test_partial_hallucination_detected(self) -> None:
|
| 207 |
+
# first claim supported, second not β whole-response NLI would miss this
|
| 208 |
+
call_count = 0
|
| 209 |
+
|
| 210 |
+
def side_effect(pairs: list, **kw: object) -> np.ndarray:
|
| 211 |
+
nonlocal call_count
|
| 212 |
+
call_count += 1
|
| 213 |
+
entailment = 0.9 if call_count == 1 else 0.1
|
| 214 |
+
return np.array([[0.1, entailment, 0.0]] * len(pairs))
|
| 215 |
+
|
| 216 |
+
mock_model = MagicMock()
|
| 217 |
+
mock_model.predict = MagicMock(side_effect=side_effect)
|
| 218 |
+
with patch("grader.get_nli_model", return_value=mock_model):
|
| 219 |
+
result = grade_faithfulness_decomposed(
|
| 220 |
+
"The product costs five dollars. It was invented in 1842.", CONTEXT
|
| 221 |
+
)
|
| 222 |
+
assert result.score == 0.5
|
| 223 |
+
assert result.metadata["claims"][0]["supported"] is True
|
| 224 |
+
assert result.metadata["claims"][1]["supported"] is False
|
| 225 |
+
|
| 226 |
+
def test_refusal_auto_passes(self) -> None:
|
| 227 |
+
result = grade_faithfulness_decomposed(
|
| 228 |
+
"I don't have enough information to answer that.", CONTEXT
|
| 229 |
+
)
|
| 230 |
+
assert result.passed is True
|
| 231 |
+
assert result.score == 1.0
|
| 232 |
+
|
| 233 |
+
def test_empty_context_fails(self) -> None:
|
| 234 |
+
with patch("grader.get_nli_model"):
|
| 235 |
+
result = grade_faithfulness_decomposed("The product costs five dollars.", "")
|
| 236 |
+
assert result.passed is False
|
| 237 |
+
assert result.score == 0.0
|
| 238 |
+
|
| 239 |
+
def test_metadata_shape(self) -> None:
|
| 240 |
+
with patch("grader.get_nli_model", return_value=_make_nli(0.8)):
|
| 241 |
+
result = grade_faithfulness_decomposed(
|
| 242 |
+
"The product is available. It ships in two days.", CONTEXT
|
| 243 |
+
)
|
| 244 |
+
for entry in result.metadata["claims"]:
|
| 245 |
+
assert "claim" in entry
|
| 246 |
+
assert "score" in entry
|
| 247 |
+
assert "supported" in entry
|
| 248 |
+
|
| 249 |
+
def test_score_is_proportion_not_max(self) -> None:
|
| 250 |
+
"""Verify score = supported/total, not max(entailment_scores)."""
|
| 251 |
+
with patch("grader.get_nli_model", return_value=_make_nli(0.9)):
|
| 252 |
+
result = grade_faithfulness_decomposed(
|
| 253 |
+
"Claim one is true. Claim two is also true. Claim three too.", CONTEXT
|
| 254 |
+
)
|
| 255 |
+
assert result.score == 1.0
|