File size: 9,909 Bytes
f748b3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
f748b3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
f748b3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f748b3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
 
 
 
 
 
f748b3d
e77a2f2
 
 
f748b3d
e77a2f2
 
 
 
 
f748b3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f6dac5
 
 
 
 
 
 
 
 
 
f748b3d
 
 
 
 
 
 
 
 
 
1f6dac5
f748b3d
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
f748b3d
 
 
 
1f6dac5
 
f748b3d
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# Architecture β€” AI Response Validator

## What this is

A domain-agnostic RAG evaluation system that validates AI responses for correctness,
faithfulness, and client-specific terminology. Built as a portfolio demonstration of
eval-driven architecture applied to production AI systems.

**Core claim:** no single metric proves correctness. The combination does.

---

## System overview

```
USER QUERY + CLIENT SELECTION
           β”‚
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   FastAPI   β”‚  /query endpoint
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚           pipeline.run()            β”‚
    β”‚                                     β”‚
    β”‚  1. retrieve()                      β”‚
    β”‚     cosine search over KB index     β”‚
    β”‚     sentence-transformers embed     β”‚
    β”‚     top-3 docs returned             β”‚
    β”‚                                     β”‚
    β”‚  2. _generate()                     β”‚
    β”‚     context injected into prompt    β”‚
    β”‚     Llama 3 (HF Inference) generates answer β”‚
    β”‚                                     β”‚
    β”‚  3. grade()                         β”‚
    β”‚     5 L1 metrics run in sequence    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚         GradeReport.summary()       β”‚
    β”‚  {overall_pass, metrics: {…}}       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
      JSON response β†’ UI eval panel
```

---

## Two-layer evaluation

### L1 β€” Live (every query, ~1-2s overhead)

Runs inline with every request. No ground truth required.

| Metric | Method | Threshold | Rationale |
|--------|--------|-----------|-----------|
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β€” fails hard |
| `token_budget` | Char count Γ· 4 | ≀ 512 tokens | Conciseness enforcement |
| `answer_relevancy` | Cosine similarity (bi-encoder) | β‰₯ 0.45 | On-topic detection |
| `faithfulness` | Vectara HHEM v2 cross-encoder | β‰₯ 0.35 | Hallucination detection |
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |

### L2 β€” Batch (local, against golden dataset)

```bash
python eval/metrics.py --domain retail
python eval/metrics.py --client novamart --out results.json
```

Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
on top of L1 metrics to verify factual completeness against reference answers.

---

## Key design decisions

### Bi-encoder vs cross-encoder: where each is used

Two fundamentally different model architectures serve different roles in this system:

| | Bi-encoder | Cross-encoder |
|---|---|---|
| **How it works** | Encodes query and document independently β†’ compare embeddings | Encodes query + document jointly β†’ single relevance score |
| **Speed** | Fast β€” embeddings pre-computed at index build time | Slow β€” must re-encode every (query, doc) pair at inference |
| **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment |
| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) |

**Measured overhead (CPU, HF Spaces):**

| Step | Model | Typical latency |
|------|-------|----------------|
| Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
| KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms |
| Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
| Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300–600 ms |
| Total grading overhead | β€” | ~350–650 ms |

**Why bi-encoder for retrieval:** query time is constant regardless of KB size because
document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
query latency β€” only index build time grows.

**Why cross-encoder for faithfulness:** cross-encoders see both the document and the
response simultaneously, capturing entailment relationships bi-encoders miss. A response
can be semantically similar to a document (high cosine) while still hallucinating specific
facts β€” the cross-encoder catches this, the bi-encoder does not.

### RosettaStone pattern

Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
Each client maps these to their own terminology. The bot must speak the client's
language, not the canonical internal names.

```
STOCK_CHECK β†’ "availability scan"   (NovaMart)
STOCK_CHECK β†’ "stock check"         (ShelfWise)
```

`check_terminology()` is deterministic β€” zero latency, no LLM, no false negatives.
It flags rival-client terms appearing without the correct client term.

**Why this matters:** in production multi-tenant AI systems, terminology leakage
between clients is a real failure mode. This catches it mechanically.

### Faithfulness via Vectara HHEM v2

The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) β€”
a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
the response is factually consistent with the document.

**Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic;
requires prompt engineering to produce consistent scores. A purpose-built cross-encoder
is faster, cheaper, and more consistent for this specific task.

**Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment
benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from
a premise β€” a different task. Correct, grounded answers score near zero on NLI entailment,
causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG
systems, which maps directly to this use case.

### In-memory semantic retrieval

KB documents are encoded once per domain at first query and cached in a module-level
dict. Cosine search at query time. No vector database, no persistent storage.

**Why no vector DB:** the KB is small (8-9 docs per domain). A vector DB would add
operational complexity with zero retrieval quality benefit at this scale.

**Tradeoff accepted:** KB updates require a process restart to invalidate the cache.
Acceptable for a demo; production would add a cache invalidation signal.

### Why two evaluation layers

L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every
query. L2 catches factual gaps by comparing against reference answers β€” requires
ground truth so it can't run live.

Running L2 on every query would add 30+ seconds of latency (LLM reverse-question
generation, per-chunk precision scoring). The two-layer split is a deliberate
latency/depth tradeoff.

---

## Repository structure

```
backend/
  app.py          FastAPI app β€” endpoints, lifespan, static file serving
  pipeline.py     Orchestrator β€” retrieve β†’ generate β†’ grade
  grader.py       L1 metric implementations + GradeReport dataclass
  rosetta.py      RosettaStone β€” canonical ↔ client term translation
  config.py       Domain/client registry, shared constants

client/
  client.py       ValidatorClient β€” typed HTTP client with retries and timeouts
  models.py       Pydantic request/response models
  exceptions.py   APIError, TimeoutError, RetryExhaustedError

tests/
  unit/           Behavioral tests β€” no network, no LLM (make test)
  integration/    End-to-end tests against live API (make test-integration)
  conftest.py     Shared fixtures and integration marker

knowledge/
  retail/
    term-catalog.yaml   Canonical β†’ client term mappings (NovaMart, ShelfWise)
    features.yaml       KB documents for retrieval
  pharma/
    term-catalog.yaml   Canonical β†’ client term mappings (ClinixOne, PharmaLink)
    features.yaml       KB documents for retrieval

eval/
  golden-dataset.yaml   20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
  metrics.py            L2 batch runner β€” CLI, keyphrase scoring, HTML report

ui/
  index.html    Chat interface + eval panel
  app.js        Domain/client switcher, message rendering, metric cards
```

---

## Deliberate tradeoffs

| Decision | Alternative | Why this |
|----------|-------------|----------|
| Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic |
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
| Plain HTML/JS frontend | React/Next.js | No build step β€” deploys as static files |
| httpx + tenacity client | requests + urllib3 retry | Cleaner timeout API, native async path if needed |
| Pydantic v2 models | TypedDict / dataclasses | Validation at boundary, IDE autocomplete, mypy strict |

---

## Evaluation coverage vs RAGAS

| RAGAS metric | Coverage |
|---|---|
| faithfulness | βœ“ L1 (Claude judge) |
| answer_relevancy | βœ“ L1 (cosine) + L2 (keyphrase) |
| context_precision | partial β€” retrieval score visible in UI |
| context_recall | βœ“ L2 (keyphrase coverage) |
| answer_correctness | βœ“ L2 (keyphrase + expected_answer) |