Spaces:

below-threshold
/

ai-response-validator

Sleeping

App Files Files Community

ai-response-validator / ARCHITECTURE.md

mbochniak01

Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes

e77a2f2 6 days ago

preview code

raw

history blame contribute delete

9.91 kB

Architecture — AI Response Validator

What this is

A domain-agnostic RAG evaluation system that validates AI responses for correctness, faithfulness, and client-specific terminology. Built as a portfolio demonstration of eval-driven architecture applied to production AI systems.

Core claim: no single metric proves correctness. The combination does.

System overview

USER QUERY + CLIENT SELECTION
           │
           ▼
    ┌─────────────┐
    │   FastAPI   │  /query endpoint
    └──────┬──────┘
           │
           ▼
    ┌─────────────────────────────────────┐
    │           pipeline.run()            │
    │                                     │
    │  1. retrieve()                      │
    │     cosine search over KB index     │
    │     sentence-transformers embed     │
    │     top-3 docs returned             │
    │                                     │
    │  2. _generate()                     │
    │     context injected into prompt    │
    │     Llama 3 (HF Inference) generates answer │
    │                                     │
    │  3. grade()                         │
    │     5 L1 metrics run in sequence    │
    └─────────────────────────────────────┘
           │
           ▼
    ┌─────────────────────────────────────┐
    │         GradeReport.summary()       │
    │  {overall_pass, metrics: {…}}       │
    └─────────────────────────────────────┘
           │
           ▼
      JSON response → UI eval panel

Two-layer evaluation

L1 — Live (every query, ~1-2s overhead)

Runs inline with every request. No ground truth required.

Metric	Method	Threshold	Rationale
`pii_leakage`	Regex (SSN, email, phone, card)	binary	Safety gate — fails hard
`token_budget`	Char count ÷ 4	≤ 512 tokens	Conciseness enforcement
`answer_relevancy`	Cosine similarity (bi-encoder)	≥ 0.45	On-topic detection
`faithfulness`	Vectara HHEM v2 cross-encoder	≥ 0.35	Hallucination detection
`chain_terminology`	Deterministic lookup (RosettaStone)	0 violations	Client language enforcement

L2 — Batch (local, against golden dataset)

python eval/metrics.py --domain retail
python eval/metrics.py --client novamart --out results.json

Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring on top of L1 metrics to verify factual completeness against reference answers.

Key design decisions

Bi-encoder vs cross-encoder: where each is used

Two fundamentally different model architectures serve different roles in this system:

	Bi-encoder	Cross-encoder
How it works	Encodes query and document independently → compare embeddings	Encodes query + document jointly → single relevance score
Speed	Fast — embeddings pre-computed at index build time	Slow — must re-encode every (query, doc) pair at inference
Quality	Good for retrieval: finds semantically similar docs	Better for re-ranking or NLI: captures fine-grained entailment
Used here for	KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy	Faithfulness scoring (Vectara HHEM v2)

Measured overhead (CPU, HF Spaces):

Step	Model	Typical latency
Query embedding	bi-encoder (`all-MiniLM-L6-v2`)	~10–15 ms
KB cosine search (1,346 docs)	numpy matrix multiply	~2 ms
Answer relevancy	bi-encoder (2 embeddings)	~10 ms
Faithfulness (3 chunk pairs)	cross-encoder (Vectara HHEM v2)	~300–600 ms
Total grading overhead	—	~350–650 ms

Why bi-encoder for retrieval: query time is constant regardless of KB size because document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change query latency — only index build time grows.

Why cross-encoder for faithfulness: cross-encoders see both the document and the response simultaneously, capturing entailment relationships bi-encoders miss. A response can be semantically similar to a document (high cosine) while still hallucinating specific facts — the cross-encoder catches this, the bi-encoder does not.

RosettaStone pattern

Each domain has a canonical term vocabulary (STOCK_CHECK, DRUG_APPROVAL, etc.). Each client maps these to their own terminology. The bot must speak the client's language, not the canonical internal names.

STOCK_CHECK → "availability scan"   (NovaMart)
STOCK_CHECK → "stock check"         (ShelfWise)

check_terminology() is deterministic — zero latency, no LLM, no false negatives. It flags rival-client terms appearing without the correct client term.

Why this matters: in production multi-tenant AI systems, terminology leakage between clients is a real failure mode. This catches it mechanically.

Faithfulness via Vectara HHEM v2

The faithfulness grader uses Vectara's Hallucination Evaluation Model — a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment). It scores (document_chunk, response) pairs and returns a probability in [0, 1] that the response is factually consistent with the document.

Why not Claude-as-judge: adds API cost and latency per query; non-deterministic; requires prompt engineering to produce consistent scores. A purpose-built cross-encoder is faster, cheaper, and more consistent for this specific task.

Why not generic NLI (DeBERTa): general NLI models are trained on textual entailment benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from a premise — a different task. Correct, grounded answers score near zero on NLI entailment, causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG systems, which maps directly to this use case.

In-memory semantic retrieval

KB documents are encoded once per domain at first query and cached in a module-level dict. Cosine search at query time. No vector database, no persistent storage.

Why no vector DB: the KB is small (8-9 docs per domain). A vector DB would add operational complexity with zero retrieval quality benefit at this scale.

Tradeoff accepted: KB updates require a process restart to invalidate the cache. Acceptable for a demo; production would add a cache invalidation signal.

Why two evaluation layers

L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every query. L2 catches factual gaps by comparing against reference answers — requires ground truth so it can't run live.

Running L2 on every query would add 30+ seconds of latency (LLM reverse-question generation, per-chunk precision scoring). The two-layer split is a deliberate latency/depth tradeoff.

Repository structure

backend/
  app.py          FastAPI app — endpoints, lifespan, static file serving
  pipeline.py     Orchestrator — retrieve → generate → grade
  grader.py       L1 metric implementations + GradeReport dataclass
  rosetta.py      RosettaStone — canonical ↔ client term translation
  config.py       Domain/client registry, shared constants

client/
  client.py       ValidatorClient — typed HTTP client with retries and timeouts
  models.py       Pydantic request/response models
  exceptions.py   APIError, TimeoutError, RetryExhaustedError

tests/
  unit/           Behavioral tests — no network, no LLM (make test)
  integration/    End-to-end tests against live API (make test-integration)
  conftest.py     Shared fixtures and integration marker

knowledge/
  retail/
    term-catalog.yaml   Canonical → client term mappings (NovaMart, ShelfWise)
    features.yaml       KB documents for retrieval
  pharma/
    term-catalog.yaml   Canonical → client term mappings (ClinixOne, PharmaLink)
    features.yaml       KB documents for retrieval

eval/
  golden-dataset.yaml   20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
  metrics.py            L2 batch runner — CLI, keyphrase scoring, HTML report

ui/
  index.html    Chat interface + eval panel
  app.js        Domain/client switcher, message rendering, metric cards

Deliberate tradeoffs

Decision	Alternative	Why this
Vectara HHEM v2 for faithfulness	Claude-as-judge / DeBERTa NLI	Purpose-built for RAG faithfulness; no API cost; deterministic
In-memory retrieval	Chroma / pgvector	No persistent storage needed at this scale
Cosine for L1 relevancy	LLM reverse-question (RAGAS)	Zero extra API cost; L2 covers the gap
Deterministic terminology check	LLM terminology judge	Zero latency, zero false negatives, auditable
Plain HTML/JS frontend	React/Next.js	No build step — deploys as static files
httpx + tenacity client	requests + urllib3 retry	Cleaner timeout API, native async path if needed
Pydantic v2 models	TypedDict / dataclasses	Validation at boundary, IDE autocomplete, mypy strict

Evaluation coverage vs RAGAS

RAGAS metric	Coverage
faithfulness	✓ L1 (Claude judge)
answer_relevancy	✓ L1 (cosine) + L2 (keyphrase)
context_precision	partial — retrieval score visible in UI
context_recall	✓ L2 (keyphrase coverage)
answer_correctness	✓ L2 (keyphrase + expected_answer)