mbochniak01 commited on
Commit
f748b3d
Β·
1 Parent(s): ebe934f

Add L2 batch evaluator and architecture documentation

Browse files
Files changed (2) hide show
  1. ARCHITECTURE.md +178 -0
  2. eval/metrics.py +134 -0
ARCHITECTURE.md CHANGED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture β€” AI Response Validator
2
+
3
+ ## What this is
4
+
5
+ A domain-agnostic RAG evaluation system that validates AI responses for correctness,
6
+ faithfulness, and client-specific terminology. Built as a portfolio demonstration of
7
+ eval-driven architecture applied to production AI systems.
8
+
9
+ **Core claim:** no single metric proves correctness. The combination does.
10
+
11
+ ---
12
+
13
+ ## System overview
14
+
15
+ ```
16
+ USER QUERY + CLIENT SELECTION
17
+ β”‚
18
+ β–Ό
19
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
20
+ β”‚ FastAPI β”‚ /query endpoint
21
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
22
+ β”‚
23
+ β–Ό
24
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ pipeline.run() β”‚
26
+ β”‚ β”‚
27
+ β”‚ 1. retrieve() β”‚
28
+ β”‚ cosine search over KB index β”‚
29
+ β”‚ sentence-transformers embed β”‚
30
+ β”‚ top-3 docs returned β”‚
31
+ β”‚ β”‚
32
+ β”‚ 2. _generate() β”‚
33
+ β”‚ context injected into prompt β”‚
34
+ β”‚ Claude Haiku generates answer β”‚
35
+ β”‚ β”‚
36
+ β”‚ 3. grade() β”‚
37
+ β”‚ 5 L1 metrics run in sequence β”‚
38
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
39
+ β”‚
40
+ β–Ό
41
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
42
+ β”‚ GradeReport.summary() β”‚
43
+ β”‚ {overall_pass, metrics: {…}} β”‚
44
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
45
+ β”‚
46
+ β–Ό
47
+ JSON response β†’ UI eval panel
48
+ ```
49
+
50
+ ---
51
+
52
+ ## Two-layer evaluation
53
+
54
+ ### L1 β€” Live (every query, ~1-2s overhead)
55
+
56
+ Runs inline with every request. No ground truth required.
57
+
58
+ | Metric | Method | Threshold | Rationale |
59
+ |--------|--------|-----------|-----------|
60
+ | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β€” fails hard |
61
+ | `token_budget` | Char count Γ· 4 | ≀ 512 tokens | Conciseness enforcement |
62
+ | `answer_relevancy` | Cosine similarity (bi-encoder) | β‰₯ 0.45 | On-topic detection |
63
+ | `faithfulness` | Claude Haiku judge (JSON output) | β‰₯ 0.70 | Hallucination detection |
64
+ | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
65
+
66
+ ### L2 β€” Batch (local, against golden dataset)
67
+
68
+ ```bash
69
+ python eval/metrics.py --domain retail
70
+ python eval/metrics.py --client novamart --out results.json
71
+ ```
72
+
73
+ Runs all 20 golden pairs through the full pipeline. Adds keyphrase coverage scoring
74
+ on top of L1 metrics to verify factual completeness against reference answers.
75
+
76
+ ---
77
+
78
+ ## Key design decisions
79
+
80
+ ### RosettaStone pattern
81
+
82
+ Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
83
+ Each client maps these to their own terminology. The bot must speak the client's
84
+ language, not the canonical internal names.
85
+
86
+ ```
87
+ STOCK_CHECK β†’ "availability scan" (NovaMart)
88
+ STOCK_CHECK β†’ "stock check" (ShelfWise)
89
+ ```
90
+
91
+ `check_terminology()` is deterministic β€” zero latency, no LLM, no false negatives.
92
+ It flags rival-client terms appearing without the correct client term.
93
+
94
+ **Why this matters:** in production multi-tenant AI systems, terminology leakage
95
+ between clients is a real failure mode. This catches it mechanically.
96
+
97
+ ### Faithfulness via Claude-as-judge
98
+
99
+ The faithfulness grader calls Claude Haiku with a structured prompt and expects
100
+ JSON output: `{faithful, score, unsupported_claims}`. This is the LLM-as-judge
101
+ pattern β€” using a fast, cheap model to evaluate a slower, more capable model's output.
102
+
103
+ **Tradeoff accepted:** adds ~0.5s latency and API cost per query. Alternative
104
+ (NLI-based local model) would be faster but less accurate for open-domain claims.
105
+
106
+ ### In-memory semantic retrieval
107
+
108
+ KB documents are encoded once per domain at first query and cached in a module-level
109
+ dict. Cosine search at query time. No vector database, no persistent storage.
110
+
111
+ **Why no vector DB:** the KB is small (8-9 docs per domain). A vector DB would add
112
+ operational complexity with zero retrieval quality benefit at this scale.
113
+
114
+ **Tradeoff accepted:** KB updates require a process restart to invalidate the cache.
115
+ Acceptable for a demo; production would add a cache invalidation signal.
116
+
117
+ ### Why two evaluation layers
118
+
119
+ L1 catches structural failures (PII, irrelevance, hallucination) instantly, on every
120
+ query. L2 catches factual gaps by comparing against reference answers β€” requires
121
+ ground truth so it can't run live.
122
+
123
+ Running L2 on every query would add 30+ seconds of latency (LLM reverse-question
124
+ generation, per-chunk precision scoring). The two-layer split is a deliberate
125
+ latency/depth tradeoff.
126
+
127
+ ---
128
+
129
+ ## Repository structure
130
+
131
+ ```
132
+ backend/
133
+ app.py FastAPI app β€” endpoints, lifespan, static file serving
134
+ pipeline.py Orchestrator β€” retrieve β†’ generate β†’ grade
135
+ grader.py L1 metric implementations + GradeReport dataclass
136
+ rosetta.py RosettaStone β€” canonical ↔ client term translation
137
+ config.py Domain/client registry, shared constants
138
+
139
+ knowledge/
140
+ retail/
141
+ term-catalog.yaml Canonical β†’ client term mappings (NovaMart, ShelfWise)
142
+ features.yaml KB documents for retrieval
143
+ pharma/
144
+ term-catalog.yaml Canonical β†’ client term mappings (ClinixOne, PharmaLink)
145
+ features.yaml KB documents for retrieval
146
+
147
+ eval/
148
+ golden-dataset.yaml 20 Q&A pairs (10 retail, 10 pharma) for L2 evaluation
149
+ metrics.py L2 batch runner β€” CLI, keyphrase scoring, summary report
150
+
151
+ ui/
152
+ index.html Chat interface + eval panel
153
+ app.js Domain/client switcher, message rendering, metric cards
154
+ ```
155
+
156
+ ---
157
+
158
+ ## Deliberate tradeoffs
159
+
160
+ | Decision | Alternative | Why this |
161
+ |----------|-------------|----------|
162
+ | Claude Haiku for faithfulness | Local NLI model (DeBERTa) | Simpler infra, better accuracy on open domain |
163
+ | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
164
+ | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
165
+ | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
166
+ | Plain HTML/JS frontend | React/Next.js | No build step β€” deploys as static files |
167
+
168
+ ---
169
+
170
+ ## Evaluation coverage vs RAGAS
171
+
172
+ | RAGAS metric | Coverage |
173
+ |---|---|
174
+ | faithfulness | βœ“ L1 (Claude judge) |
175
+ | answer_relevancy | βœ“ L1 (cosine) + L2 (keyphrase) |
176
+ | context_precision | partial β€” retrieval score visible in UI |
177
+ | context_recall | βœ“ L2 (keyphrase coverage) |
178
+ | answer_correctness | βœ“ L2 (keyphrase + expected_answer) |
eval/metrics.py CHANGED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ L2 batch evaluation β€” runs the golden dataset through the pipeline locally.
3
+
4
+ Usage:
5
+ python eval/metrics.py # all pairs
6
+ python eval/metrics.py --domain retail # one domain
7
+ python eval/metrics.py --client novamart # one client
8
+ python eval/metrics.py --out results.json # write output
9
+
10
+ Requires ANTHROPIC_API_KEY in environment.
11
+ Reads golden-dataset.yaml, calls pipeline.run() per pair, scores with grader,
12
+ and prints a per-metric summary table.
13
+ """
14
+
15
+ import argparse
16
+ import json
17
+ import logging
18
+ import os
19
+ import sys
20
+ from pathlib import Path
21
+
22
+ import yaml
23
+
24
+ sys.path.insert(0, str(Path(__file__).parent.parent / "backend"))
25
+
26
+ import anthropic
27
+ from pipeline import run
28
+
29
+ log = logging.getLogger(__name__)
30
+ logging.basicConfig(level=logging.INFO, format="%(message)s")
31
+
32
+ DATASET_PATH = Path(__file__).parent / "golden-dataset.yaml"
33
+
34
+
35
+ def load_pairs(domain: str | None = None, client: str | None = None) -> list[dict]:
36
+ data = yaml.safe_load(DATASET_PATH.read_text())
37
+ pairs = data["pairs"]
38
+ if domain:
39
+ pairs = [p for p in pairs if p["domain"] == domain]
40
+ if client:
41
+ pairs = [p for p in pairs if p["client"] == client]
42
+ return pairs
43
+
44
+
45
+ def score_pair(pair: dict, anthropic_client: anthropic.Anthropic) -> dict:
46
+ """Run one golden pair through the pipeline and return scored result."""
47
+ result = run(
48
+ query=pair["question"],
49
+ client=pair["client"],
50
+ anthropic_client=anthropic_client,
51
+ )
52
+ payload = result.response_payload
53
+ metrics = payload["evaluation"]["metrics"]
54
+
55
+ # Keyphrase coverage: how many expected_contains appear in the answer
56
+ expected = pair.get("expected_contains", [])
57
+ answer_lower = result.answer.lower()
58
+ matched = [kw for kw in expected if kw.lower() in answer_lower]
59
+ keyphrase_coverage = len(matched) / len(expected) if expected else 1.0
60
+
61
+ return {
62
+ "id": pair["id"],
63
+ "client": pair["client"],
64
+ "domain": pair["domain"],
65
+ "question": pair["question"],
66
+ "answer": result.answer,
67
+ "keyphrase_coverage": round(keyphrase_coverage, 3),
68
+ "matched_keyphrases": matched,
69
+ "missing_keyphrases": [kw for kw in expected if kw not in matched],
70
+ "metrics": metrics,
71
+ "overall_pass": payload["evaluation"]["overall_pass"],
72
+ "sources": [s["title"] for s in payload["sources"]],
73
+ }
74
+
75
+
76
+ def print_summary(results: list[dict]) -> None:
77
+ metric_names = list(results[0]["metrics"].keys()) if results else []
78
+ total = len(results)
79
+ passed = sum(1 for r in results if r["overall_pass"])
80
+
81
+ log.info("\n── Summary ─────────────────────────────────────")
82
+ log.info("Pairs evaluated : %d", total)
83
+ log.info("Overall pass : %d / %d (%.0f%%)", passed, total, 100 * passed / total if total else 0)
84
+
85
+ log.info("\n── Per-metric pass rate ────────────────────────")
86
+ for name in metric_names:
87
+ n_pass = sum(1 for r in results if r["metrics"][name]["passed"])
88
+ avg_score = sum(r["metrics"][name]["score"] for r in results) / total if total else 0
89
+ log.info(" %-22s %d/%d avg %.2f", name, n_pass, total, avg_score)
90
+
91
+ log.info("\n── Keyphrase coverage ──────────────────────────")
92
+ avg_cov = sum(r["keyphrase_coverage"] for r in results) / total if total else 0
93
+ log.info(" Average coverage: %.0f%%", avg_cov * 100)
94
+
95
+ failures = [r for r in results if not r["overall_pass"]]
96
+ if failures:
97
+ log.info("\n── Failed pairs ────────────────────────────────")
98
+ for r in failures:
99
+ failed_metrics = [m for m, v in r["metrics"].items() if not v["passed"]]
100
+ log.info(" [%s] %s", r["id"], ", ".join(failed_metrics))
101
+
102
+
103
+ def main() -> None:
104
+ parser = argparse.ArgumentParser(description="L2 batch evaluation against golden dataset")
105
+ parser.add_argument("--domain", help="Filter by domain (retail|pharma)")
106
+ parser.add_argument("--client", help="Filter by client id")
107
+ parser.add_argument("--out", help="Write full results to JSON file")
108
+ args = parser.parse_args()
109
+
110
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
111
+ if not api_key:
112
+ sys.exit("ANTHROPIC_API_KEY not set")
113
+
114
+ anthropic_client = anthropic.Anthropic(api_key=api_key)
115
+ pairs = load_pairs(domain=args.domain, client=args.client)
116
+
117
+ if not pairs:
118
+ sys.exit("No pairs matched the given filters")
119
+
120
+ log.info("Evaluating %d pairs...", len(pairs))
121
+ results = []
122
+ for i, pair in enumerate(pairs, 1):
123
+ log.info("[%d/%d] %s", i, len(pairs), pair["id"])
124
+ results.append(score_pair(pair, anthropic_client))
125
+
126
+ print_summary(results)
127
+
128
+ if args.out:
129
+ Path(args.out).write_text(json.dumps(results, indent=2))
130
+ log.info("\nResults written to %s", args.out)
131
+
132
+
133
+ if __name__ == "__main__":
134
+ main()