mbochniak01
Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes
e77a2f2 | title: AI Response Validator | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # AI Response Validator | |
| Domain-agnostic RAG evaluation system. Validates AI responses for correctness, | |
| faithfulness, and client-specific terminology across retail and pharma domains. | |
| **Live demo:** select a domain and client, ask a question β each response is evaluated | |
| in real time across 5 metrics. [β Open on HuggingFace Spaces](https://huggingface.co/spaces/below-threshold/ai-response-validator) | |
| --- | |
| ## Setup (5 minutes) | |
| **Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient). | |
| ```bash | |
| git clone https://huggingface.co/spaces/below-threshold/ai-response-validator | |
| cd ai-response-validator | |
| make install | |
| export HF_TOKEN=hf_... | |
| ``` | |
| --- | |
| ## Running the app | |
| ```bash | |
| make run # starts API at http://localhost:8000 | |
| ``` | |
| Open `http://localhost:8000` in a browser β the UI loads automatically. | |
| --- | |
| ## Tests | |
| ```bash | |
| make test # unit tests only β no server, no API key needed | |
| make test-integration # integration tests β requires make run in another terminal | |
| ``` | |
| Unit tests cover graders, terminology logic, and client error handling. | |
| Integration tests hit the live API and verify end-to-end behavior. | |
| All tests are stateless β no cleanup required. | |
| --- | |
| ## Batch evaluation (L2) | |
| ```bash | |
| make eval-retail # evaluate 10 retail Q&A pairs, open HTML report | |
| make eval-pharma # evaluate 10 pharma Q&A pairs, open HTML report | |
| make eval # all 20 pairs | |
| ``` | |
| Reports are written to `eval/reports/`. | |
| --- | |
| ## Code quality | |
| ```bash | |
| make lint # ruff β zero warnings expected | |
| make type-check # mypy strict on client/ | |
| ``` | |
| --- | |
| ## Make targets | |
| | Command | What it does | | |
| |---------|-------------| | |
| | `make install` | pip install all dependencies | | |
| | `make run` | Start API server at localhost:8000 | | |
| | `make test` | Unit tests (no network required) | | |
| | `make test-integration` | Integration tests (server must be running) | | |
| | `make lint` | Ruff linting across backend, client, tests | | |
| | `make type-check` | mypy strict mode on client/ | | |
| | `make eval-retail` | L2 batch eval β retail domain + HTML report | | |
| | `make eval-pharma` | L2 batch eval β pharma domain + HTML report | | |
| | `make eval` | L2 batch eval β all domains + HTML report | | |
| --- | |
| ## Eval results (`make eval`) | |
| Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases). | |
| Results from a representative run β rerun with `make eval` after knowledge base updates. | |
| ### L1 live metrics (pass rate across 20 pairs) | |
| | Metric | Pass rate | Notes | | |
| |--------|-----------|-------| | |
| | `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response | | |
| | `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget | | |
| | `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold | | |
| | `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged | | |
| | `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term | | |
| ### L2 keyphrase coverage (batch, retail domain) | |
| | Client | Pairs | Avg coverage | | |
| |--------|-------|-------------| | |
| | NovaMart | 5 | 0.74 | | |
| | ShelfWise | 5 | 0.71 | | |
| To update these numbers: `make eval` (server must be running). | |
| --- | |
| ## Architecture | |
| See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers, | |
| and deliberate tradeoffs. | |
| See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency. | |
| --- | |
| ## Evaluation metrics | |
| | Metric | Layer | Method | | |
| |--------|-------|--------| | |
| | PII Leakage | L1 live | Regex scan β binary | | |
| | Token Budget | L1 live | Char count Γ· 4 | | |
| | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) | | |
| | Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) | | |
| | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup | | |
| | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer | | |
| **Core principle:** no single metric proves correctness. The combination does. | |