mbochniak01 commited on
Commit Β·
f748b3d
1
Parent(s): ebe934f
Add L2 batch evaluator and architecture documentation
Browse files- ARCHITECTURE.md +178 -0
- eval/metrics.py +134 -0
ARCHITECTURE.md
CHANGED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture β AI Response Validator
|
| 2 |
+
|
| 3 |
+
## What this is
|
| 4 |
+
|
| 5 |
+
A domain-agnostic RAG evaluation system that validates AI responses for correctness,
|
| 6 |
+
faithfulness, and client-specific terminology. Built as a portfolio demonstration of
|
| 7 |
+
eval-driven architecture applied to production AI systems.
|
| 8 |
+
|
| 9 |
+
**Core claim:** no single metric proves correctness. The combination does.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## System overview
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
USER QUERY + CLIENT SELECTION
|
| 17 |
+
β
|
| 18 |
+
βΌ
|
| 19 |
+
βββββββββββββββ
|
| 20 |
+
β FastAPI β /query endpoint
|
| 21 |
+
ββββββββ¬βββββββ
|
| 22 |
+
β
|
| 23 |
+
βΌ
|
| 24 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 25 |
+
β pipeline.run() β
|
| 26 |
+
β β
|
| 27 |
+
β 1. retrieve() β
|
| 28 |
+
β cosine search over KB index β
|
| 29 |
+
β sentence-transformers embed β
|
| 30 |
+
β top-3 docs returned β
|
| 31 |
+
β β
|
| 32 |
+
β 2. _generate() β
|
| 33 |
+
β context injected into prompt β
|
| 34 |
+
β Claude Haiku generates answer β
|
| 35 |
+
β β
|
| 36 |
+
β 3. grade() β
|
| 37 |
+
β 5 L1 metrics run in sequence β
|
| 38 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 39 |
+
β
|
| 40 |
+
βΌ
|
| 41 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 42 |
+
β GradeReport.summary() β
|
| 43 |
+
β {overall_pass, metrics: {β¦}} β
|
| 44 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 45 |
+
β
|
| 46 |
+
βΌ
|
| 47 |
+
JSON response β UI eval panel
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Two-layer evaluation
|
| 53 |
+
|
| 54 |
+
### L1 β Live (every query, ~1-2s overhead)
|
| 55 |
+
|
| 56 |
+
Runs inline with every request. No ground truth required.
|
| 57 |
+
|
| 58 |
+
| Metric | Method | Threshold | Rationale |
|
| 59 |
+
|--------|--------|-----------|-----------|
|
| 60 |
+
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard |
|
| 61 |
+
| `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement |
|
| 62 |
+
| `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection |
|
| 63 |
+
| `faithfulness` | Claude Haiku judge (JSON output) | β₯ 0.70 | Hallucination detection |
|
| 64 |
+
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
|
| 65 |
+
|
| 66 |
+
### L2 β Batch (local, against golden dataset)
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
python eval/metrics.py --domain retail
|
| 70 |
+
python eval/metrics.py --client novamart --out results.json
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
|
| 74 |
+
on top of L1 metrics to verify factual completeness against reference answers.
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## Key design decisions
|
| 79 |
+
|
| 80 |
+
### RosettaStone pattern
|
| 81 |
+
|
| 82 |
+
Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
|
| 83 |
+
Each client maps these to their own terminology. The bot must speak the client's
|
| 84 |
+
language, not the canonical internal names.
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
STOCK_CHECK β "availability scan" (NovaMart)
|
| 88 |
+
STOCK_CHECK β "stock check" (ShelfWise)
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
`check_terminology()` is deterministic β zero latency, no LLM, no false negatives.
|
| 92 |
+
It flags rival-client terms appearing without the correct client term.
|
| 93 |
+
|
| 94 |
+
**Why this matters:** in production multi-tenant AI systems, terminology leakage
|
| 95 |
+
between clients is a real failure mode. This catches it mechanically.
|
| 96 |
+
|
| 97 |
+
### Faithfulness via Claude-as-judge
|
| 98 |
+
|
| 99 |
+
The faithfulness grader calls Claude Haiku with a structured prompt and expects
|
| 100 |
+
JSON output: `{faithful, score, unsupported_claims}`. This is the LLM-as-judge
|
| 101 |
+
pattern β using a fast, cheap model to evaluate a slower, more capable model's output.
|
| 102 |
+
|
| 103 |
+
**Tradeoff accepted:** adds ~0.5s latency and API cost per query. Alternative
|
| 104 |
+
(NLI-based local model) would be faster but less accurate for open-domain claims.
|
| 105 |
+
|
| 106 |
+
### In-memory semantic retrieval
|
| 107 |
+
|
| 108 |
+
KB documents are encoded once per domain at first query and cached in a module-level
|
| 109 |
+
dict. Cosine search at query time. No vector database, no persistent storage.
|
| 110 |
+
|
| 111 |
+
**Why no vector DB:** the KB is small (8-9 docs per domain). A vector DB would add
|
| 112 |
+
operational complexity with zero retrieval quality benefit at this scale.
|
| 113 |
+
|
| 114 |
+
**Tradeoff accepted:** KB updates require a process restart to invalidate the cache.
|
| 115 |
+
Acceptable for a demo; production would add a cache invalidation signal.
|
| 116 |
+
|
| 117 |
+
### Why two evaluation layers
|
| 118 |
+
|
| 119 |
+
L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every
|
| 120 |
+
query. L2 catches factual gaps by comparing against reference answers β requires
|
| 121 |
+
ground truth so it can't run live.
|
| 122 |
+
|
| 123 |
+
Running L2 on every query would add 30+ seconds of latency (LLM reverse-question
|
| 124 |
+
generation, per-chunk precision scoring). The two-layer split is a deliberate
|
| 125 |
+
latency/depth tradeoff.
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Repository structure
|
| 130 |
+
|
| 131 |
+
```
|
| 132 |
+
backend/
|
| 133 |
+
app.py FastAPI app β endpoints, lifespan, static file serving
|
| 134 |
+
pipeline.py Orchestrator β retrieve β generate β grade
|
| 135 |
+
grader.py L1 metric implementations + GradeReport dataclass
|
| 136 |
+
rosetta.py RosettaStone β canonical β client term translation
|
| 137 |
+
config.py Domain/client registry, shared constants
|
| 138 |
+
|
| 139 |
+
knowledge/
|
| 140 |
+
retail/
|
| 141 |
+
term-catalog.yaml Canonical β client term mappings (NovaMart, ShelfWise)
|
| 142 |
+
features.yaml KB documents for retrieval
|
| 143 |
+
pharma/
|
| 144 |
+
term-catalog.yaml Canonical β client term mappings (ClinixOne, PharmaLink)
|
| 145 |
+
features.yaml KB documents for retrieval
|
| 146 |
+
|
| 147 |
+
eval/
|
| 148 |
+
golden-dataset.yaml 20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
|
| 149 |
+
metrics.py L2 batch runner β CLI, keyphrase scoring, summary report
|
| 150 |
+
|
| 151 |
+
ui/
|
| 152 |
+
index.html Chat interface + eval panel
|
| 153 |
+
app.js Domain/client switcher, message rendering, metric cards
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## Deliberate tradeoffs
|
| 159 |
+
|
| 160 |
+
| Decision | Alternative | Why this |
|
| 161 |
+
|----------|-------------|----------|
|
| 162 |
+
| Claude Haiku for faithfulness | Local NLI model (DeBERTa) | Simpler infra, better accuracy on open domain |
|
| 163 |
+
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
|
| 164 |
+
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
|
| 165 |
+
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
|
| 166 |
+
| Plain HTML/JS frontend | React/Next.js | No build step β deploys as static files |
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Evaluation coverage vs RAGAS
|
| 171 |
+
|
| 172 |
+
| RAGAS metric | Coverage |
|
| 173 |
+
|---|---|
|
| 174 |
+
| faithfulness | β L1 (Claude judge) |
|
| 175 |
+
| answer_relevancy | β L1 (cosine) + L2 (keyphrase) |
|
| 176 |
+
| context_precision | partial β retrieval score visible in UI |
|
| 177 |
+
| context_recall | β L2 (keyphrase coverage) |
|
| 178 |
+
| answer_correctness | β L2 (keyphrase + expected_answer) |
|
eval/metrics.py
CHANGED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
L2 batch evaluation β runs the golden dataset through the pipeline locally.
|
| 3 |
+
|
| 4 |
+
Usage:
|
| 5 |
+
python eval/metrics.py # all pairs
|
| 6 |
+
python eval/metrics.py --domain retail # one domain
|
| 7 |
+
python eval/metrics.py --client novamart # one client
|
| 8 |
+
python eval/metrics.py --out results.json # write output
|
| 9 |
+
|
| 10 |
+
Requires ANTHROPIC_API_KEY in environment.
|
| 11 |
+
Reads golden-dataset.yaml, calls pipeline.run() per pair, scores with grader,
|
| 12 |
+
and prints a per-metric summary table.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import json
|
| 17 |
+
import logging
|
| 18 |
+
import os
|
| 19 |
+
import sys
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
import yaml
|
| 23 |
+
|
| 24 |
+
sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
|
| 25 |
+
|
| 26 |
+
import anthropic
|
| 27 |
+
from pipeline import run
|
| 28 |
+
|
| 29 |
+
log = logging.getLogger(__name__)
|
| 30 |
+
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
| 31 |
+
|
| 32 |
+
DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def load_pairs(domain: str | None = None, client: str | None = None) -> list[dict]:
|
| 36 |
+
data = yaml.safe_load(DATASET_PATH.read_text())
|
| 37 |
+
pairs = data["pairs"]
|
| 38 |
+
if domain:
|
| 39 |
+
pairs = [p for p in pairs if p["domain"] == domain]
|
| 40 |
+
if client:
|
| 41 |
+
pairs = [p for p in pairs if p["client"] == client]
|
| 42 |
+
return pairs
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def score_pair(pair: dict, anthropic_client: anthropic.Anthropic) -> dict:
|
| 46 |
+
"""Run one golden pair through the pipeline and return scored result."""
|
| 47 |
+
result = run(
|
| 48 |
+
query=pair["question"],
|
| 49 |
+
client=pair["client"],
|
| 50 |
+
anthropic_client=anthropic_client,
|
| 51 |
+
)
|
| 52 |
+
payload = result.response_payload
|
| 53 |
+
metrics = payload["evaluation"]["metrics"]
|
| 54 |
+
|
| 55 |
+
# Keyphrase coverage: how many expected_contains appear in the answer
|
| 56 |
+
expected = pair.get("expected_contains", [])
|
| 57 |
+
answer_lower = result.answer.lower()
|
| 58 |
+
matched = [kw for kw in expected if kw.lower() in answer_lower]
|
| 59 |
+
keyphrase_coverage = len(matched) / len(expected) if expected else 1.0
|
| 60 |
+
|
| 61 |
+
return {
|
| 62 |
+
"id": pair["id"],
|
| 63 |
+
"client": pair["client"],
|
| 64 |
+
"domain": pair["domain"],
|
| 65 |
+
"question": pair["question"],
|
| 66 |
+
"answer": result.answer,
|
| 67 |
+
"keyphrase_coverage": round(keyphrase_coverage, 3),
|
| 68 |
+
"matched_keyphrases": matched,
|
| 69 |
+
"missing_keyphrases": [kw for kw in expected if kw not in matched],
|
| 70 |
+
"metrics": metrics,
|
| 71 |
+
"overall_pass": payload["evaluation"]["overall_pass"],
|
| 72 |
+
"sources": [s["title"] for s in payload["sources"]],
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def print_summary(results: list[dict]) -> None:
|
| 77 |
+
metric_names = list(results[0]["metrics"].keys()) if results else []
|
| 78 |
+
total = len(results)
|
| 79 |
+
passed = sum(1 for r in results if r["overall_pass"])
|
| 80 |
+
|
| 81 |
+
log.info("\nββ Summary βββββββββββββββββββββββββββββββββββββ")
|
| 82 |
+
log.info("Pairs evaluated : %d", total)
|
| 83 |
+
log.info("Overall pass : %d / %d (%.0f%%)", passed, total, 100 * passed / total if total else 0)
|
| 84 |
+
|
| 85 |
+
log.info("\nββ Per-metric pass rate ββββββββββββββββββββββββ")
|
| 86 |
+
for name in metric_names:
|
| 87 |
+
n_pass = sum(1 for r in results if r["metrics"][name]["passed"])
|
| 88 |
+
avg_score = sum(r["metrics"][name]["score"] for r in results) / total if total else 0
|
| 89 |
+
log.info(" %-22s %d/%d avg %.2f", name, n_pass, total, avg_score)
|
| 90 |
+
|
| 91 |
+
log.info("\nββ Keyphrase coverage ββββββββββββββββββββββββββ")
|
| 92 |
+
avg_cov = sum(r["keyphrase_coverage"] for r in results) / total if total else 0
|
| 93 |
+
log.info(" Average coverage: %.0f%%", avg_cov * 100)
|
| 94 |
+
|
| 95 |
+
failures = [r for r in results if not r["overall_pass"]]
|
| 96 |
+
if failures:
|
| 97 |
+
log.info("\nββ Failed pairs ββββββββββββββββββββββββββββββββ")
|
| 98 |
+
for r in failures:
|
| 99 |
+
failed_metrics = [m for m, v in r["metrics"].items() if not v["passed"]]
|
| 100 |
+
log.info(" [%s] %s", r["id"], ", ".join(failed_metrics))
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def main() -> None:
|
| 104 |
+
parser = argparse.ArgumentParser(description="L2 batch evaluation against golden dataset")
|
| 105 |
+
parser.add_argument("--domain", help="Filter by domain (retail|pharma)")
|
| 106 |
+
parser.add_argument("--client", help="Filter by client id")
|
| 107 |
+
parser.add_argument("--out", help="Write full results to JSON file")
|
| 108 |
+
args = parser.parse_args()
|
| 109 |
+
|
| 110 |
+
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 111 |
+
if not api_key:
|
| 112 |
+
sys.exit("ANTHROPIC_API_KEY not set")
|
| 113 |
+
|
| 114 |
+
anthropic_client = anthropic.Anthropic(api_key=api_key)
|
| 115 |
+
pairs = load_pairs(domain=args.domain, client=args.client)
|
| 116 |
+
|
| 117 |
+
if not pairs:
|
| 118 |
+
sys.exit("No pairs matched the given filters")
|
| 119 |
+
|
| 120 |
+
log.info("Evaluating %d pairs...", len(pairs))
|
| 121 |
+
results = []
|
| 122 |
+
for i, pair in enumerate(pairs, 1):
|
| 123 |
+
log.info("[%d/%d] %s", i, len(pairs), pair["id"])
|
| 124 |
+
results.append(score_pair(pair, anthropic_client))
|
| 125 |
+
|
| 126 |
+
print_summary(results)
|
| 127 |
+
|
| 128 |
+
if args.out:
|
| 129 |
+
Path(args.out).write_text(json.dumps(results, indent=2))
|
| 130 |
+
log.info("\nResults written to %s", args.out)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
if __name__ == "__main__":
|
| 134 |
+
main()
|