Spaces:

below-threshold
/

ai-response-validator

Sleeping

File size: 4,644 Bytes

---
title: AI Response Validator
emoji: 🔍
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
---

# AI Response Validator

Domain-agnostic RAG evaluation system. Validates AI responses for correctness,
faithfulness, and client-specific terminology across retail and pharma domains.

**Live demo:** select a domain and client, ask a question — each response is evaluated
in real time across 5 metrics. [→ Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator)

---

## Setup (5 minutes)

**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).

```bash
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
cd ai-response-validator
make install
export HF_TOKEN=hf_...
```

---

## Running the app

```bash
make run        # starts API at http://localhost:8000
```

Open `http://localhost:8000` in a browser — the UI loads automatically.

---

## Tests

```bash
make test                 # unit tests only — no server, no API key needed
make test-integration     # integration tests — requires make run in another terminal
```

Unit tests cover graders, terminology logic, and client error handling.
Integration tests hit the live API and verify end-to-end behavior.
All tests are stateless — no cleanup required.

---

## Batch evaluation (L2)

```bash
make eval-retail      # evaluate retail Q&A pairs, open HTML report
make eval-pharma      # evaluate pharma Q&A pairs, open HTML report
make eval             # all domains
```

Reports are written to `eval/reports/`.

**Drift detection** (no server required):

```bash
python eval/simulate_traffic.py   # populate telemetry + run drift report
python eval/drift.py              # drift report against live telemetry
```

Compares live grader score distributions against the golden-dataset baseline using KS tests.
Detects faithfulness degradation from model updates, KB staleness, or query distribution shift.

---

## Code quality

```bash
make lint             # ruff — zero warnings expected
make type-check       # mypy strict on client/
```

---

## Make targets

| Command | What it does |
|---------|-------------|
| `make install` | pip install all dependencies |
| `make run` | Start API server at localhost:8000 |
| `make test` | Unit tests (no network required) |
| `make test-integration` | Integration tests (server must be running) |
| `make lint` | Ruff linting across backend, client, tests |
| `make type-check` | mypy strict mode on client/ |
| `make eval-retail` | L2 batch eval — retail domain + HTML report |
| `make eval-pharma` | L2 batch eval — pharma domain + HTML report |
| `make eval` | L2 batch eval — all domains + HTML report |

---

## Eval results (`make eval`)

Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
Results from a representative run — rerun with `make eval` after knowledge base updates.

### L1 live metrics (pass rate across 20 pairs)

| Metric | Pass rate | Notes |
|--------|-----------|-------|
| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |

### L2 keyphrase coverage (batch, retail domain)

| Client | Pairs | Avg coverage |
|--------|-------|-------------|
| NovaMart | 5 | 0.74 |
| ShelfWise | 5 | 0.71 |

To update these numbers: `make eval` (server must be running).

---

## Architecture

See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
and deliberate tradeoffs.

See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency.

---

## Evaluation metrics

| Metric | Layer | Method |
|--------|-------|--------|
| PII Leakage | L1 live | Regex scan — binary |
| Token Budget | L1 live | Char count ÷ 4 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
| Faithfulness | L1 live | Claim decomposition + sentence-level NLI cross-encoder |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
| Drift Detection | L2 offline | KS two-sample test vs golden-dataset baseline |

**Core principle:** no single metric proves correctness. The combination does.