File size: 4,644 Bytes
b917936 ebe934f b917936 ebe934f b917936 ebe934f b917936 ebe934f 1f6dac5 e77a2f2 1f6dac5 e77a2f2 1f6dac5 ffbf46f 1f6dac5 ffbf46f 1f6dac5 e77a2f2 1f6dac5 ffbf46f 1f6dac5 ffbf46f 1f6dac5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
title: AI Response Validator
emoji: π
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---
# AI Response Validator
Domain-agnostic RAG evaluation system. Validates AI responses for correctness,
faithfulness, and client-specific terminology across retail and pharma domains.
**Live demo:** select a domain and client, ask a question β each response is evaluated
in real time across 5 metrics. [β Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator)
---
## Setup (5 minutes)
**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).
```bash
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...
```
---
## Running the app
```bash
make run # starts API at http://localhost:8000
```
Open `http://localhost:8000` in a browser β the UI loads automatically.
---
## Tests
```bash
make test # unit tests only β no server, no API key needed
make test-integration # integration tests β requires make run in another terminal
```
Unit tests cover graders, terminology logic, and client error handling.
Integration tests hit the live API and verify end-to-end behavior.
All tests are stateless β no cleanup required.
---
## Batch evaluation (L2)
```bash
make eval-retail # evaluate retail Q&A pairs, open HTML report
make eval-pharma # evaluate pharma Q&A pairs, open HTML report
make eval # all domains
```
Reports are written to `eval/reports/`.
**Drift detection** (no server required):
```bash
python eval/simulate_traffic.py # populate telemetry + run drift report
python eval/drift.py # drift report against live telemetry
```
Compares live grader score distributions against the golden-dataset baseline using KS tests.
Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.
---
## Code quality
```bash
make lint # ruff β zero warnings expected
make type-check # mypy strict on client/
```
---
## Make targets
| Command | What it does |
|---------|-------------|
| `make install` | pip install all dependencies |
| `make run` | Start API server at localhost:8000 |
| `make test` | Unit tests (no network required) |
| `make test-integration` | Integration tests (server must be running) |
| `make lint` | Ruff linting across backend, client, tests |
| `make type-check` | mypy strict mode on client/ |
| `make eval-retail` | L2 batch eval β retail domain + HTML report |
| `make eval-pharma` | L2 batch eval β pharma domain + HTML report |
| `make eval` | L2 batch eval β all domains + HTML report |
---
## Eval results (`make eval`)
Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
Results from a representative run β rerun with `make eval` after knowledge base updates.
### L1 live metrics (pass rate across 20 pairs)
| Metric | Pass rate | Notes |
|--------|-----------|-------|
| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |
### L2 keyphrase coverage (batch, retail domain)
| Client | Pairs | Avg coverage |
|--------|-------|-------------|
| NovaMart | 5 | 0.74 |
| ShelfWise | 5 | 0.71 |
To update these numbers: `make eval` (server must be running).
---
## Architecture
See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
and deliberate tradeoffs.
See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency.
---
## Evaluation metrics
| Metric | Layer | Method |
|--------|-------|--------|
| PII Leakage | L1 live | Regex scan β binary |
| Token Budget | L1 live | Char count Γ· 4 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
| Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
| Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |
**Core principle:** no single metric proves correctness. The combination does.
|