mbochniak01
Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes
e77a2f2
---
title: AI Response Validator
emoji: πŸ”
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---
# AI Response Validator
Domain-agnostic RAG evaluation system. Validates AI responses for correctness,
faithfulness, and client-specific terminology across retail and pharma domains.
**Live demo:** select a domain and client, ask a question β€” each response is evaluated
in real time across 5 metrics. [β†’ Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator)
---
## Setup (5 minutes)
**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).
```bash
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...
```
---
## Running the app
```bash
make run # starts API at http://localhost:8000
```
Open `http://localhost:8000` in a browser β€” the UI loads automatically.
---
## Tests
```bash
make test # unit tests only β€” no server, no API key needed
make test-integration # integration tests β€” requires make run in another terminal
```
Unit tests cover graders, terminology logic, and client error handling.
Integration tests hit the live API and verify end-to-end behavior.
All tests are stateless β€” no cleanup required.
---
## Batch evaluation (L2)
```bash
make eval-retail # evaluate 10 retail Q&A pairs, open HTML report
make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report
make eval # all 20 pairs
```
Reports are written to `eval/reports/`.
---
## Code quality
```bash
make lint # ruff β€” zero warnings expected
make type-check # mypy strict on client/
```
---
## Make targets
| Command | What it does |
|---------|-------------|
| `make install` | pip install all dependencies |
| `make run` | Start API server at localhost:8000 |
| `make test` | Unit tests (no network required) |
| `make test-integration` | Integration tests (server must be running) |
| `make lint` | Ruff linting across backend, client, tests |
| `make type-check` | mypy strict mode on client/ |
| `make eval-retail` | L2 batch eval β€” retail domain + HTML report |
| `make eval-pharma` | L2 batch eval β€” pharma domain + HTML report |
| `make eval` | L2 batch eval β€” all domains + HTML report |
---
## Eval results (`make eval`)
Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
Results from a representative run β€” rerun with `make eval` after knowledge base updates.
### L1 live metrics (pass rate across 20 pairs)
| Metric | Pass rate | Notes |
|--------|-----------|-------|
| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |
### L2 keyphrase coverage (batch, retail domain)
| Client | Pairs | Avg coverage |
|--------|-------|-------------|
| NovaMart | 5 | 0.74 |
| ShelfWise | 5 | 0.71 |
To update these numbers: `make eval` (server must be running).
---
## Architecture
See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
and deliberate tradeoffs.
See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency.
---
## Evaluation metrics
| Metric | Layer | Method |
|--------|-------|--------|
| PII Leakage | L1 live | Regex scan β€” binary |
| Token Budget | L1 live | Char count Γ· 4 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
| Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
**Core principle:** no single metric proves correctness. The combination does.