title: AI Response Validator
emoji: π
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
AI Response Validator
Domain-agnostic RAG evaluation system. Validates AI responses for correctness, faithfulness, and client-specific terminology across retail and pharma domains.
Live demo: select a domain and client, ask a question β each response is evaluated in real time across 5 metrics. β Open on HuggingFace Spaces
Setup (5 minutes)
Requirements: Python 3.11+, HF_TOKEN in environment (HuggingFace account, free tier sufficient).
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...
Running the app
make run # starts API at http://localhost:8000
Open http://localhost:8000 in a browser β the UI loads automatically.
Tests
make test # unit tests only β no server, no API key needed
make test-integration # integration tests β requires make run in another terminal
Unit tests cover graders, terminology logic, and client error handling. Integration tests hit the live API and verify end-to-end behavior. All tests are stateless β no cleanup required.
Batch evaluation (L2)
make eval-retail # evaluate 10 retail Q&A pairs, open HTML report
make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report
make eval # all 20 pairs
Reports are written to eval/reports/.
Code quality
make lint # ruff β zero warnings expected
make type-check # mypy strict on client/
Make targets
| Command | What it does |
|---|---|
make install |
pip install all dependencies |
make run |
Start API server at localhost:8000 |
make test |
Unit tests (no network required) |
make test-integration |
Integration tests (server must be running) |
make lint |
Ruff linting across backend, client, tests |
make type-check |
mypy strict mode on client/ |
make eval-retail |
L2 batch eval β retail domain + HTML report |
make eval-pharma |
L2 batch eval β pharma domain + HTML report |
make eval |
L2 batch eval β all domains + HTML report |
Eval results (make eval)
Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
Results from a representative run β rerun with make eval after knowledge base updates.
L1 live metrics (pass rate across 20 pairs)
| Metric | Pass rate | Notes |
|---|---|---|
pii_leakage |
20/20 (100%) | No PII patterns detected in any response |
token_budget |
19/20 (95%) | One verbose pharma response exceeded 512-token budget |
answer_relevancy |
17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
faithfulness |
16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
chain_terminology |
18/20 (90%) | 2 responses used canonical key instead of client-specific term |
L2 keyphrase coverage (batch, retail domain)
| Client | Pairs | Avg coverage |
|---|---|---|
| NovaMart | 5 | 0.74 |
| ShelfWise | 5 | 0.71 |
To update these numbers: make eval (server must be running).
Architecture
See ARCHITECTURE.md for system design, evaluation layers, and deliberate tradeoffs.
See NOTES.md for design decisions, what's next, and LLM transparency.
Evaluation metrics
| Metric | Layer | Method |
|---|---|---|
| PII Leakage | L1 live | Regex scan β binary |
| Token Budget | L1 live | Char count Γ· 4 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
| Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
Core principle: no single metric proves correctness. The combination does.