Spaces:

below-threshold
/

ai-response-validator

Sleeping

App Files Files Community

ai-response-validator / README.md

mbochniak01

Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes

e77a2f2 8 days ago

preview code

raw

history blame contribute delete

4.16 kB

	---
	title: AI Response Validator
	emoji: 🔍
	colorFrom: blue
	colorTo: blue
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# AI Response Validator

	Domain-agnostic RAG evaluation system. Validates AI responses for correctness,
	faithfulness, and client-specific terminology across retail and pharma domains.

	Live demo: select a domain and client, ask a question — each response is evaluated
	in real time across 5 metrics. [→ Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator)

	---

	## Setup (5 minutes)

	Requirements: Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).

	```bash
	git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
	cd ai-response-validator
	make install
	export HF_TOKEN=hf_...
	```

	---

	## Running the app

	```bash
	make run # starts API at http://localhost:8000
	```

	Open `http://localhost:8000` in a browser — the UI loads automatically.

	---

	## Tests

	```bash
	make test # unit tests only — no server, no API key needed
	make test-integration # integration tests — requires make run in another terminal
	```

	Unit tests cover graders, terminology logic, and client error handling.
	Integration tests hit the live API and verify end-to-end behavior.
	All tests are stateless — no cleanup required.

	---

	## Batch evaluation (L2)

	```bash
	make eval-retail # evaluate 10 retail Q&A pairs, open HTML report
	make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report
	make eval # all 20 pairs
	```

	Reports are written to `eval/reports/`.

	---

	## Code quality

	```bash
	make lint # ruff — zero warnings expected
	make type-check # mypy strict on client/
	```

	---

	## Make targets

	\| Command \| What it does \|
	\|---------\|-------------\|
	\| `make install` \| pip install all dependencies \|
	\| `make run` \| Start API server at localhost:8000 \|
	\| `make test` \| Unit tests (no network required) \|
	\| `make test-integration` \| Integration tests (server must be running) \|
	\| `make lint` \| Ruff linting across backend, client, tests \|
	\| `make type-check` \| mypy strict mode on client/ \|
	\| `make eval-retail` \| L2 batch eval — retail domain + HTML report \|
	\| `make eval-pharma` \| L2 batch eval — pharma domain + HTML report \|
	\| `make eval` \| L2 batch eval — all domains + HTML report \|

	---

	## Eval results (`make eval`)

	Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
	Results from a representative run — rerun with `make eval` after knowledge base updates.

	### L1 live metrics (pass rate across 20 pairs)

	\| Metric \| Pass rate \| Notes \|
	\|--------\|-----------\|-------\|
	\| `pii_leakage` \| 20/20 (100%) \| No PII patterns detected in any response \|
	\| `token_budget` \| 19/20 (95%) \| One verbose pharma response exceeded 512-token budget \|
	\| `answer_relevancy` \| 17/20 (85%) \| 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold \|
	\| `faithfulness` \| 16/20 (80%) \| Refusal responses correctly auto-pass; 4 partial-context answers flagged \|
	\| `chain_terminology` \| 18/20 (90%) \| 2 responses used canonical key instead of client-specific term \|

	### L2 keyphrase coverage (batch, retail domain)

	\| Client \| Pairs \| Avg coverage \|
	\|--------\|-------\|-------------\|
	\| NovaMart \| 5 \| 0.74 \|
	\| ShelfWise \| 5 \| 0.71 \|

	To update these numbers: `make eval` (server must be running).

	---

	## Architecture

	See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
	and deliberate tradeoffs.

	See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency.

	---

	## Evaluation metrics

	\| Metric \| Layer \| Method \|
	\|--------\|-------\|--------\|
	\| PII Leakage \| L1 live \| Regex scan — binary \|
	\| Token Budget \| L1 live \| Char count ÷ 4 \|
	\| Answer Relevancy \| L1 live \| Cosine similarity (bi-encoder) \|
	\| Faithfulness \| L1 live \| Vectara HHEM v2 (cross-encoder) \|
	\| Chain Terminology \| L1 live + L2 \| Deterministic RosettaStone lookup \|
	\| Keyphrase Coverage \| L2 batch \| Expected keyphrases matched in answer \|

	Core principle: no single metric proves correctness. The combination does.