--- title: AI Response Validator emoji: ๐Ÿ” colorFrom: blue colorTo: blue sdk: docker app_port: 7860 pinned: false --- # AI Response Validator Domain-agnostic RAG evaluation system. Validates AI responses for correctness, faithfulness, and client-specific terminology across retail and pharma domains. **Live demo:** select a domain and client, ask a question โ€” each response is evaluated in real time across 5 metrics. [โ†’ Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator) --- ## Setup (5 minutes) **Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient). ```bash git clone https://huggingface.co/spaces/below-threshold/ai-response-validator cd ai-response-validator make install export HF_TOKEN=hf_... ``` --- ## Running the app ```bash make run # starts API at http://localhost:8000 ``` Open `http://localhost:8000` in a browser โ€” the UI loads automatically. --- ## Tests ```bash make test # unit tests only โ€” no server, no API key needed make test-integration # integration tests โ€” requires make run in another terminal ``` Unit tests cover graders, terminology logic, and client error handling. Integration tests hit the live API and verify end-to-end behavior. All tests are stateless โ€” no cleanup required. --- ## Batch evaluation (L2) ```bash make eval-retail # evaluate 10 retail Q&A pairs, open HTML report make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report make eval # all 20 pairs ``` Reports are written to `eval/reports/`. --- ## Code quality ```bash make lint # ruff โ€” zero warnings expected make type-check # mypy strict on client/ ``` --- ## Make targets | Command | What it does | |---------|-------------| | `make install` | pip install all dependencies | | `make run` | Start API server at localhost:8000 | | `make test` | Unit tests (no network required) | | `make test-integration` | Integration tests (server must be running) | | `make lint` | Ruff linting across backend, client, tests | | `make type-check` | mypy strict mode on client/ | | `make eval-retail` | L2 batch eval โ€” retail domain + HTML report | | `make eval-pharma` | L2 batch eval โ€” pharma domain + HTML report | | `make eval` | L2 batch eval โ€” all domains + HTML report | --- ## Eval results (`make eval`) Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases). Results from a representative run โ€” rerun with `make eval` after knowledge base updates. ### L1 live metrics (pass rate across 20 pairs) | Metric | Pass rate | Notes | |--------|-----------|-------| | `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response | | `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget | | `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold | | `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged | | `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term | ### L2 keyphrase coverage (batch, retail domain) | Client | Pairs | Avg coverage | |--------|-------|-------------| | NovaMart | 5 | 0.74 | | ShelfWise | 5 | 0.71 | To update these numbers: `make eval` (server must be running). --- ## Architecture See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers, and deliberate tradeoffs. See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency. --- ## Evaluation metrics | Metric | Layer | Method | |--------|-------|--------| | PII Leakage | L1 live | Regex scan โ€” binary | | Token Budget | L1 live | Char count รท 4 | | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) | | Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) | | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup | | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer | **Core principle:** no single metric proves correctness. The combination does.