Spaces:

below-threshold
/

ai-response-validator

Sleeping

App Files Files Community

mbochniak01 commited on 27 days ago

Commit

f748b3d

1 Parent(s): ebe934f

Add L2 batch evaluator and architecture documentation

Browse files

Files changed (2) hide show

ARCHITECTURE.md +178 -0
eval/metrics.py +134 -0

ARCHITECTURE.md CHANGED Viewed

	@@ -0,0 +1,178 @@

+# Architecture — AI Response Validator
+## What this is
+A domain-agnostic RAG evaluation system that validates AI responses for correctness,
+faithfulness, and client-specific terminology. Built as a portfolio demonstration of
+eval-driven architecture applied to production AI systems.
+**Core claim:** no single metric proves correctness. The combination does.
+---
+## System overview
+```
+USER QUERY + CLIENT SELECTION
+           │
+           ▼
+    ┌─────────────┐
+    │   FastAPI   │  /query endpoint
+    └──────┬──────┘
+           │
+           ▼
+    ┌─────────────────────────────────────┐
+    │           pipeline.run()            │
+    │                                     │
+    │  1. retrieve()                      │
+    │     cosine search over KB index     │
+    │     sentence-transformers embed     │
+    │     top-3 docs returned             │
+    │                                     │
+    │  2. _generate()                     │
+    │     context injected into prompt    │
+    │     Claude Haiku generates answer   │
+    │                                     │
+    │  3. grade()                         │
+    │     5 L1 metrics run in sequence    │
+    └─────────────────────────────────────┘
+           │
+           ▼
+    ┌─────────────────────────────────────┐
+    │         GradeReport.summary()       │
+    │  {overall_pass, metrics: {…}}       │
+    └─────────────────────────────────────┘
+           │
+           ▼
+      JSON response → UI eval panel
+```
+---
+## Two-layer evaluation
+### L1 — Live (every query, ~1-2s overhead)
+Runs inline with every request. No ground truth required.
+| Metric | Method | Threshold | Rationale |
+|--------|--------|-----------|-----------|
+| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate — fails hard |
+| `token_budget` | Char count ÷ 4 | ≤ 512 tokens | Conciseness enforcement |
+| `answer_relevancy` | Cosine similarity (bi-encoder) | ≥ 0.45 | On-topic detection |
+| `faithfulness` | Claude Haiku judge (JSON output) | ≥ 0.70 | Hallucination detection |
+| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
+### L2 — Batch (local, against golden dataset)
+```bash
+python eval/metrics.py --domain retail
+python eval/metrics.py --client novamart --out results.json
+```
+Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
+on top of L1 metrics to verify factual completeness against reference answers.
+---
+## Key design decisions
+### RosettaStone pattern
+Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
+Each client maps these to their own terminology. The bot must speak the client's
+language, not the canonical internal names.
+```
+STOCK_CHECK → "availability scan"   (NovaMart)
+STOCK_CHECK → "stock check"         (ShelfWise)
+```
+`check_terminology()` is deterministic — zero latency, no LLM, no false negatives.
+It flags rival-client terms appearing without the correct client term.
+**Why this matters:** in production multi-tenant AI systems, terminology leakage
+between clients is a real failure mode. This catches it mechanically.
+### Faithfulness via Claude-as-judge
+The faithfulness grader calls Claude Haiku with a structured prompt and expects
+JSON output: `{faithful, score, unsupported_claims}`. This is the LLM-as-judge
+pattern — using a fast, cheap model to evaluate a slower, more capable model's output.
+**Tradeoff accepted:** adds ~0.5s latency and API cost per query. Alternative
+(NLI-based local model) would be faster but less accurate for open-domain claims.
+### In-memory semantic retrieval
+KB documents are encoded once per domain at first query and cached in a module-level
+dict. Cosine search at query time. No vector database, no persistent storage.
+**Why no vector DB:** the KB is small (8-9 docs per domain). A vector DB would add
+operational complexity with zero retrieval quality benefit at this scale.
+**Tradeoff accepted:** KB updates require a process restart to invalidate the cache.
+Acceptable for a demo; production would add a cache invalidation signal.
+### Why two evaluation layers
+L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every
+query. L2 catches factual gaps by comparing against reference answers — requires
+ground truth so it can't run live.
+Running L2 on every query would add 30+ seconds of latency (LLM reverse-question
+generation, per-chunk precision scoring). The two-layer split is a deliberate
+latency/depth tradeoff.
+---
+## Repository structure
+```
+backend/
+  app.py          FastAPI app — endpoints, lifespan, static file serving
+  pipeline.py     Orchestrator — retrieve → generate → grade
+  grader.py       L1 metric implementations + GradeReport dataclass
+  rosetta.py      RosettaStone — canonical ↔ client term translation
+  config.py       Domain/client registry, shared constants
+knowledge/
+  retail/
+    term-catalog.yaml   Canonical → client term mappings (NovaMart, ShelfWise)
+    features.yaml       KB documents for retrieval
+  pharma/
+    term-catalog.yaml   Canonical → client term mappings (ClinixOne, PharmaLink)
+    features.yaml       KB documents for retrieval
+eval/
+  golden-dataset.yaml   20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
+  metrics.py            L2 batch runner — CLI, keyphrase scoring, summary report
+ui/
+  index.html    Chat interface + eval panel
+  app.js        Domain/client switcher, message rendering, metric cards
+```
+---
+## Deliberate tradeoffs
+| Decision | Alternative | Why this |
+|----------|-------------|----------|
+| Claude Haiku for faithfulness | Local NLI model (DeBERTa) | Simpler infra, better accuracy on open domain |
+| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
+| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
+| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
+| Plain HTML/JS frontend | React/Next.js | No build step — deploys as static files |
+---
+## Evaluation coverage vs RAGAS
+| RAGAS metric | Coverage |
+|---|---|
+| faithfulness | ✓ L1 (Claude judge) |
+| answer_relevancy | ✓ L1 (cosine) + L2 (keyphrase) |
+| context_precision | partial — retrieval score visible in UI |
+| context_recall | ✓ L2 (keyphrase coverage) |
+| answer_correctness | ✓ L2 (keyphrase + expected_answer) |

eval/metrics.py CHANGED Viewed

	@@ -0,0 +1,134 @@

+"""
+L2 batch evaluation — runs the golden dataset through the pipeline locally.
+Usage:
+    python eval/metrics.py                        # all pairs
+    python eval/metrics.py --domain retail        # one domain
+    python eval/metrics.py --client novamart      # one client
+    python eval/metrics.py --out results.json     # write output
+Requires ANTHROPIC_API_KEY in environment.
+Reads golden-dataset.yaml, calls pipeline.run() per pair, scores with grader,
+and prints a per-metric summary table.
+"""
+import argparse
+import json
+import logging
+import os
+import sys
+from pathlib import Path
+import yaml
+sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
+import anthropic
+from pipeline import run
+log = logging.getLogger(__name__)
+logging.basicConfig(level=logging.INFO, format="%(message)s")
+DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
+def load_pairs(domain: str | None = None, client: str | None = None) -> list[dict]:
+    data = yaml.safe_load(DATASET_PATH.read_text())
+    pairs = data["pairs"]
+    if domain:
+        pairs = [p for p in pairs if p["domain"] == domain]
+    if client:
+        pairs = [p for p in pairs if p["client"] == client]
+    return pairs
+def score_pair(pair: dict, anthropic_client: anthropic.Anthropic) -> dict:
+    """Run one golden pair through the pipeline and return scored result."""
+    result = run(
+        query=pair["question"],
+        client=pair["client"],
+        anthropic_client=anthropic_client,
+    )
+    payload = result.response_payload
+    metrics = payload["evaluation"]["metrics"]
+    # Keyphrase coverage: how many expected_contains appear in the answer
+    expected = pair.get("expected_contains", [])
+    answer_lower = result.answer.lower()
+    matched = [kw for kw in expected if kw.lower() in answer_lower]
+    keyphrase_coverage = len(matched) / len(expected) if expected else 1.0
+    return {
+        "id": pair["id"],
+        "client": pair["client"],
+        "domain": pair["domain"],
+        "question": pair["question"],
+        "answer": result.answer,
+        "keyphrase_coverage": round(keyphrase_coverage, 3),
+        "matched_keyphrases": matched,
+        "missing_keyphrases": [kw for kw in expected if kw not in matched],
+        "metrics": metrics,
+        "overall_pass": payload["evaluation"]["overall_pass"],
+        "sources": [s["title"] for s in payload["sources"]],
+    }
+def print_summary(results: list[dict]) -> None:
+    metric_names = list(results[0]["metrics"].keys()) if results else []
+    total = len(results)
+    passed = sum(1 for r in results if r["overall_pass"])
+    log.info("\n── Summary ─────────────────────────────────────")
+    log.info("Pairs evaluated : %d", total)
+    log.info("Overall pass    : %d / %d (%.0f%%)", passed, total, 100 * passed / total if total else 0)
+    log.info("\n── Per-metric pass rate ────────────────────────")
+    for name in metric_names:
+        n_pass = sum(1 for r in results if r["metrics"][name]["passed"])
+        avg_score = sum(r["metrics"][name]["score"] for r in results) / total if total else 0
+        log.info("  %-22s %d/%d  avg %.2f", name, n_pass, total, avg_score)
+    log.info("\n── Keyphrase coverage ──────────────────────────")
+    avg_cov = sum(r["keyphrase_coverage"] for r in results) / total if total else 0
+    log.info("  Average coverage: %.0f%%", avg_cov * 100)
+    failures = [r for r in results if not r["overall_pass"]]
+    if failures:
+        log.info("\n── Failed pairs ────────────────────────────────")
+        for r in failures:
+            failed_metrics = [m for m, v in r["metrics"].items() if not v["passed"]]
+            log.info("  [%s] %s", r["id"], ", ".join(failed_metrics))
+def main() -> None:
+    parser = argparse.ArgumentParser(description="L2 batch evaluation against golden dataset")
+    parser.add_argument("--domain", help="Filter by domain (retail|pharma)")
+    parser.add_argument("--client", help="Filter by client id")
+    parser.add_argument("--out", help="Write full results to JSON file")
+    args = parser.parse_args()
+    api_key = os.environ.get("ANTHROPIC_API_KEY")
+    if not api_key:
+        sys.exit("ANTHROPIC_API_KEY not set")
+    anthropic_client = anthropic.Anthropic(api_key=api_key)
+    pairs = load_pairs(domain=args.domain, client=args.client)
+    if not pairs:
+        sys.exit("No pairs matched the given filters")
+    log.info("Evaluating %d pairs...", len(pairs))
+    results = []
+    for i, pair in enumerate(pairs, 1):
+        log.info("[%d/%d] %s", i, len(pairs), pair["id"])
+        results.append(score_pair(pair, anthropic_client))
+    print_summary(results)
+    if args.out:
+        Path(args.out).write_text(json.dumps(results, indent=2))
+        log.info("\nResults written to %s", args.out)
+if __name__ == "__main__":
+    main()