Spaces:

below-threshold
/

ai-response-validator

Sleeping

App Files Files Community

ai-response-validator / README.md

mbochniak01

Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes

e77a2f2 8 days ago

preview code

raw

history blame contribute delete

4.16 kB

metadata

title: AI Response Validator
emoji: 🔍
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false

AI Response Validator

Domain-agnostic RAG evaluation system. Validates AI responses for correctness, faithfulness, and client-specific terminology across retail and pharma domains.

Live demo: select a domain and client, ask a question — each response is evaluated in real time across 5 metrics. → Open on HuggingFace Spaces

Setup (5 minutes)

Requirements: Python 3.11+, HF_TOKEN in environment (HuggingFace account, free tier sufficient).

git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...

Running the app

make run        # starts API at http://localhost:8000

Open http://localhost:8000 in a browser — the UI loads automatically.

Tests

make test                 # unit tests only — no server, no API key needed
make test-integration     # integration tests — requires make run in another terminal

Unit tests cover graders, terminology logic, and client error handling. Integration tests hit the live API and verify end-to-end behavior. All tests are stateless — no cleanup required.

Batch evaluation (L2)

make eval-retail      # evaluate 10 retail Q&A pairs, open HTML report
make eval-pharma      # evaluate 10 pharma Q&A pairs, open HTML report
make eval             # all 20 pairs

Reports are written to eval/reports/.

Code quality

make lint             # ruff — zero warnings expected
make type-check       # mypy strict on client/

Make targets

Command	What it does
`make install`	pip install all dependencies
`make run`	Start API server at localhost:8000
`make test`	Unit tests (no network required)
`make test-integration`	Integration tests (server must be running)
`make lint`	Ruff linting across backend, client, tests
`make type-check`	mypy strict mode on client/
`make eval-retail`	L2 batch eval — retail domain + HTML report
`make eval-pharma`	L2 batch eval — pharma domain + HTML report
`make eval`	L2 batch eval — all domains + HTML report

Eval results (`make eval`)

Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases). Results from a representative run — rerun with make eval after knowledge base updates.

L1 live metrics (pass rate across 20 pairs)

Metric	Pass rate	Notes
`pii_leakage`	20/20 (100%)	No PII patterns detected in any response
`token_budget`	19/20 (95%)	One verbose pharma response exceeded 512-token budget
`answer_relevancy`	17/20 (85%)	3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold
`faithfulness`	16/20 (80%)	Refusal responses correctly auto-pass; 4 partial-context answers flagged
`chain_terminology`	18/20 (90%)	2 responses used canonical key instead of client-specific term

L2 keyphrase coverage (batch, retail domain)

Client	Pairs	Avg coverage
NovaMart	5	0.74
ShelfWise	5	0.71

To update these numbers: make eval (server must be running).

Architecture

See ARCHITECTURE.md for system design, evaluation layers, and deliberate tradeoffs.

See NOTES.md for design decisions, what's next, and LLM transparency.

Evaluation metrics

Metric	Layer	Method
PII Leakage	L1 live	Regex scan — binary
Token Budget	L1 live	Char count ÷ 4
Answer Relevancy	L1 live	Cosine similarity (bi-encoder)
Faithfulness	L1 live	Vectara HHEM v2 (cross-encoder)
Chain Terminology	L1 live + L2	Deterministic RosettaStone lookup
Keyphrase Coverage	L2 batch	Expected keyphrases matched in answer

Core principle: no single metric proves correctness. The combination does.