File size: 4,644 Bytes
b917936
ebe934f
 
b917936
ebe934f
b917936
ebe934f
b917936
 
 
ebe934f
 
 
 
 
1f6dac5
 
 
 
 
 
 
e77a2f2
1f6dac5
 
 
 
 
e77a2f2
1f6dac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffbf46f
 
 
1f6dac5
 
 
 
ffbf46f
 
 
 
 
 
 
 
 
 
1f6dac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f6dac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffbf46f
1f6dac5
 
ffbf46f
1f6dac5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: AI Response Validator
emoji: πŸ”
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---

# AI Response Validator

Domain-agnostic RAG evaluation system. Validates AI responses for correctness,
faithfulness, and client-specific terminology across retail and pharma domains.

**Live demo:** select a domain and client, ask a question β€” each response is evaluated
in real time across 5 metrics. [β†’ Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator)

---

## Setup (5 minutes)

**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).

```bash
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...
```

---

## Running the app

```bash
make run        # starts API at http://localhost:8000
```

Open `http://localhost:8000` in a browser β€” the UI loads automatically.

---

## Tests

```bash
make test                 # unit tests only β€” no server, no API key needed
make test-integration     # integration tests β€” requires make run in another terminal
```

Unit tests cover graders, terminology logic, and client error handling.
Integration tests hit the live API and verify end-to-end behavior.
All tests are stateless β€” no cleanup required.

---

## Batch evaluation (L2)

```bash
make eval-retail      # evaluate retail Q&A pairs, open HTML report
make eval-pharma      # evaluate pharma Q&A pairs, open HTML report
make eval             # all domains
```

Reports are written to `eval/reports/`.

**Drift detection** (no server required):

```bash
python eval/simulate_traffic.py   # populate telemetry + run drift report
python eval/drift.py              # drift report against live telemetry
```

Compares live grader score distributions against the golden-dataset baseline using KS tests.
Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.

---

## Code quality

```bash
make lint             # ruff β€” zero warnings expected
make type-check       # mypy strict on client/
```

---

## Make targets

| Command | What it does |
|---------|-------------|
| `make install` | pip install all dependencies |
| `make run` | Start API server at localhost:8000 |
| `make test` | Unit tests (no network required) |
| `make test-integration` | Integration tests (server must be running) |
| `make lint` | Ruff linting across backend, client, tests |
| `make type-check` | mypy strict mode on client/ |
| `make eval-retail` | L2 batch eval β€” retail domain + HTML report |
| `make eval-pharma` | L2 batch eval β€” pharma domain + HTML report |
| `make eval` | L2 batch eval β€” all domains + HTML report |

---

## Eval results (`make eval`)

Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
Results from a representative run β€” rerun with `make eval` after knowledge base updates.

### L1 live metrics (pass rate across 20 pairs)

| Metric | Pass rate | Notes |
|--------|-----------|-------|
| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |

### L2 keyphrase coverage (batch, retail domain)

| Client | Pairs | Avg coverage |
|--------|-------|-------------|
| NovaMart | 5 | 0.74 |
| ShelfWise | 5 | 0.71 |

To update these numbers: `make eval` (server must be running).

---

## Architecture

See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
and deliberate tradeoffs.

See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency.

---

## Evaluation metrics

| Metric | Layer | Method |
|--------|-------|--------|
| PII Leakage | L1 live | Regex scan β€” binary |
| Token Budget | L1 live | Char count Γ· 4 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
| Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
| Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |

**Core principle:** no single metric proves correctness. The combination does.