mbochniak01
Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes
e77a2f2 | # Architecture β AI Response Validator | |
| ## What this is | |
| A domain-agnostic RAG evaluation system that validates AI responses for correctness, | |
| faithfulness, and client-specific terminology. Built as a portfolio demonstration of | |
| eval-driven architecture applied to production AI systems. | |
| **Core claim:** no single metric proves correctness. The combination does. | |
| --- | |
| ## System overview | |
| ``` | |
| USER QUERY + CLIENT SELECTION | |
| β | |
| βΌ | |
| βββββββββββββββ | |
| β FastAPI β /query endpoint | |
| ββββββββ¬βββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β pipeline.run() β | |
| β β | |
| β 1. retrieve() β | |
| β cosine search over KB index β | |
| β sentence-transformers embed β | |
| β top-3 docs returned β | |
| β β | |
| β 2. _generate() β | |
| β context injected into prompt β | |
| β Llama 3 (HF Inference) generates answer β | |
| β β | |
| β 3. grade() β | |
| β 5 L1 metrics run in sequence β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β GradeReport.summary() β | |
| β {overall_pass, metrics: {β¦}} β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| JSON response β UI eval panel | |
| ``` | |
| --- | |
| ## Two-layer evaluation | |
| ### L1 β Live (every query, ~1-2s overhead) | |
| Runs inline with every request. No ground truth required. | |
| | Metric | Method | Threshold | Rationale | | |
| |--------|--------|-----------|-----------| | |
| | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard | | |
| | `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement | | |
| | `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection | | |
| | `faithfulness` | Vectara HHEM v2 cross-encoder | β₯ 0.35 | Hallucination detection | | |
| | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement | | |
| ### L2 β Batch (local, against golden dataset) | |
| ```bash | |
| python eval/metrics.py --domain retail | |
| python eval/metrics.py --client novamart --out results.json | |
| ``` | |
| Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring | |
| on top of L1 metrics to verify factual completeness against reference answers. | |
| --- | |
| ## Key design decisions | |
| ### Bi-encoder vs cross-encoder: where each is used | |
| Two fundamentally different model architectures serve different roles in this system: | |
| | | Bi-encoder | Cross-encoder | | |
| |---|---|---| | |
| | **How it works** | Encodes query and document independently β compare embeddings | Encodes query + document jointly β single relevance score | | |
| | **Speed** | Fast β embeddings pre-computed at index build time | Slow β must re-encode every (query, doc) pair at inference | | |
| | **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment | | |
| | **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) | | |
| **Measured overhead (CPU, HF Spaces):** | |
| | Step | Model | Typical latency | | |
| |------|-------|----------------| | |
| | Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10β15 ms | | |
| | KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms | | |
| | Answer relevancy | bi-encoder (2 embeddings) | ~10 ms | | |
| | Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300β600 ms | | |
| | Total grading overhead | β | ~350β650 ms | | |
| **Why bi-encoder for retrieval:** query time is constant regardless of KB size because | |
| document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change | |
| query latency β only index build time grows. | |
| **Why cross-encoder for faithfulness:** cross-encoders see both the document and the | |
| response simultaneously, capturing entailment relationships bi-encoders miss. A response | |
| can be semantically similar to a document (high cosine) while still hallucinating specific | |
| facts β the cross-encoder catches this, the bi-encoder does not. | |
| ### RosettaStone pattern | |
| Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.). | |
| Each client maps these to their own terminology. The bot must speak the client's | |
| language, not the canonical internal names. | |
| ``` | |
| STOCK_CHECK β "availability scan" (NovaMart) | |
| STOCK_CHECK β "stock check" (ShelfWise) | |
| ``` | |
| `check_terminology()` is deterministic β zero latency, no LLM, no false negatives. | |
| It flags rival-client terms appearing without the correct client term. | |
| **Why this matters:** in production multi-tenant AI systems, terminology leakage | |
| between clients is a real failure mode. This catches it mechanically. | |
| ### Faithfulness via Vectara HHEM v2 | |
| The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) β | |
| a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment). | |
| It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that | |
| the response is factually consistent with the document. | |
| **Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic; | |
| requires prompt engineering to produce consistent scores. A purpose-built cross-encoder | |
| is faster, cheaper, and more consistent for this specific task. | |
| **Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment | |
| benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from | |
| a premise β a different task. Correct, grounded answers score near zero on NLI entailment, | |
| causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG | |
| systems, which maps directly to this use case. | |
| ### In-memory semantic retrieval | |
| KB documents are encoded once per domain at first query and cached in a module-level | |
| dict. Cosine search at query time. No vector database, no persistent storage. | |
| **Why no vector DB:** the KB is small (8-9 docs per domain). A vector DB would add | |
| operational complexity with zero retrieval quality benefit at this scale. | |
| **Tradeoff accepted:** KB updates require a process restart to invalidate the cache. | |
| Acceptable for a demo; production would add a cache invalidation signal. | |
| ### Why two evaluation layers | |
| L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every | |
| query. L2 catches factual gaps by comparing against reference answers β requires | |
| ground truth so it can't run live. | |
| Running L2 on every query would add 30+ seconds of latency (LLM reverse-question | |
| generation, per-chunk precision scoring). The two-layer split is a deliberate | |
| latency/depth tradeoff. | |
| --- | |
| ## Repository structure | |
| ``` | |
| backend/ | |
| app.py FastAPI app β endpoints, lifespan, static file serving | |
| pipeline.py Orchestrator β retrieve β generate β grade | |
| grader.py L1 metric implementations + GradeReport dataclass | |
| rosetta.py RosettaStone β canonical β client term translation | |
| config.py Domain/client registry, shared constants | |
| client/ | |
| client.py ValidatorClient β typed HTTP client with retries and timeouts | |
| models.py Pydantic request/response models | |
| exceptions.py APIError, TimeoutError, RetryExhaustedError | |
| tests/ | |
| unit/ Behavioral tests β no network, no LLM (make test) | |
| integration/ End-to-end tests against live API (make test-integration) | |
| conftest.py Shared fixtures and integration marker | |
| knowledge/ | |
| retail/ | |
| term-catalog.yaml Canonical β client term mappings (NovaMart, ShelfWise) | |
| features.yaml KB documents for retrieval | |
| pharma/ | |
| term-catalog.yaml Canonical β client term mappings (ClinixOne, PharmaLink) | |
| features.yaml KB documents for retrieval | |
| eval/ | |
| golden-dataset.yaml 20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation | |
| metrics.py L2 batch runner β CLI, keyphrase scoring, HTML report | |
| ui/ | |
| index.html Chat interface + eval panel | |
| app.js Domain/client switcher, message rendering, metric cards | |
| ``` | |
| --- | |
| ## Deliberate tradeoffs | |
| | Decision | Alternative | Why this | | |
| |----------|-------------|----------| | |
| | Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic | | |
| | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale | | |
| | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap | | |
| | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable | | |
| | Plain HTML/JS frontend | React/Next.js | No build step β deploys as static files | | |
| | httpx + tenacity client | requests + urllib3 retry | Cleaner timeout API, native async path if needed | | |
| | Pydantic v2 models | TypedDict / dataclasses | Validation at boundary, IDE autocomplete, mypy strict | | |
| --- | |
| ## Evaluation coverage vs RAGAS | |
| | RAGAS metric | Coverage | | |
| |---|---| | |
| | faithfulness | β L1 (Claude judge) | | |
| | answer_relevancy | β L1 (cosine) + L2 (keyphrase) | | |
| | context_precision | partial β retrieval score visible in UI | | |
| | context_recall | β L2 (keyphrase coverage) | | |
| | answer_correctness | β L2 (keyphrase + expected_answer) | | |