mbochniak01
Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes
e77a2f2
metadata
title: AI Response Validator
emoji: πŸ”
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false

AI Response Validator

Domain-agnostic RAG evaluation system. Validates AI responses for correctness, faithfulness, and client-specific terminology across retail and pharma domains.

Live demo: select a domain and client, ask a question β€” each response is evaluated in real time across 5 metrics. β†’ Open on HuggingFace Spaces


Setup (5 minutes)

Requirements: Python 3.11+, HF_TOKEN in environment (HuggingFace account, free tier sufficient).

git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...

Running the app

make run        # starts API at http://localhost:8000

Open http://localhost:8000 in a browser β€” the UI loads automatically.


Tests

make test                 # unit tests only β€” no server, no API key needed
make test-integration     # integration tests β€” requires make run in another terminal

Unit tests cover graders, terminology logic, and client error handling. Integration tests hit the live API and verify end-to-end behavior. All tests are stateless β€” no cleanup required.


Batch evaluation (L2)

make eval-retail      # evaluate 10 retail Q&A pairs, open HTML report
make eval-pharma      # evaluate 10 pharma Q&A pairs, open HTML report
make eval             # all 20 pairs

Reports are written to eval/reports/.


Code quality

make lint             # ruff β€” zero warnings expected
make type-check       # mypy strict on client/

Make targets

Command What it does
make install pip install all dependencies
make run Start API server at localhost:8000
make test Unit tests (no network required)
make test-integration Integration tests (server must be running)
make lint Ruff linting across backend, client, tests
make type-check mypy strict mode on client/
make eval-retail L2 batch eval β€” retail domain + HTML report
make eval-pharma L2 batch eval β€” pharma domain + HTML report
make eval L2 batch eval β€” all domains + HTML report

Eval results (make eval)

Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases). Results from a representative run β€” rerun with make eval after knowledge base updates.

L1 live metrics (pass rate across 20 pairs)

Metric Pass rate Notes
pii_leakage 20/20 (100%) No PII patterns detected in any response
token_budget 19/20 (95%) One verbose pharma response exceeded 512-token budget
answer_relevancy 17/20 (85%) 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold
faithfulness 16/20 (80%) Refusal responses correctly auto-pass; 4 partial-context answers flagged
chain_terminology 18/20 (90%) 2 responses used canonical key instead of client-specific term

L2 keyphrase coverage (batch, retail domain)

Client Pairs Avg coverage
NovaMart 5 0.74
ShelfWise 5 0.71

To update these numbers: make eval (server must be running).


Architecture

See ARCHITECTURE.md for system design, evaluation layers, and deliberate tradeoffs.

See NOTES.md for design decisions, what's next, and LLM transparency.


Evaluation metrics

Metric Layer Method
PII Leakage L1 live Regex scan β€” binary
Token Budget L1 live Char count Γ· 4
Answer Relevancy L1 live Cosine similarity (bi-encoder)
Faithfulness L1 live Vectara HHEM v2 (cross-encoder)
Chain Terminology L1 live + L2 Deterministic RosettaStone lookup
Keyphrase Coverage L2 batch Expected keyphrases matched in answer

Core principle: no single metric proves correctness. The combination does.