File size: 8,660 Bytes
10aced5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffbf46f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e77a2f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffbf46f
e77a2f2
 
 
ffbf46f
 
e77a2f2
 
 
10aced5
 
 
 
 
 
 
 
 
 
 
907c06a
 
 
 
 
 
 
 
 
ffbf46f
 
 
 
907c06a
 
 
 
 
ffbf46f
 
 
10aced5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cdbafd
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Design Notes

## Key decisions and tradeoffs

### API target: own implementation
Instead of wrapping a third-party fake API, the client wraps this project's own
FastAPI backend. This means the client and the API are co-designed β€” the typed
models on both sides stay in sync by design. The tradeoff: less realistic than
wrapping an external API you don't control, but the test surface is richer and
the integration tests verify real business logic, not just HTTP plumbing.

### Two-layer evaluation (L1 live / L2 batch)
L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden
dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics
(contextual precision, reverse-question relevancy) add 30+ seconds per pair β€”
unacceptable live, fine in batch. The golden dataset is the contract; L2 is the
regression gate.

### Deterministic chain_terminology over LLM judge
The terminology check is a dict lookup, not a model call. Zero latency, zero cost,
zero false negatives on known mappings. The tradeoff: it only catches terms in the
catalog β€” novel terminology drift goes undetected. An LLM judge would catch drift
but would introduce latency and non-determinism into a metric that must be auditable.

### In-memory retrieval over vector database
KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search
at query time adds ~2ms retrieval overhead with no infrastructure dependency.
A vector DB (Chroma, pgvector) would add operational complexity with zero
retrieval quality gain at this scale.

### httpx + tenacity for the client
`httpx` is the modern alternative to `requests`: native async support if needed
later, cleaner timeout API, better type annotations. `tenacity` separates retry
policy from request logic cleanly β€” the retry decorator is readable and testable
independently from the HTTP code.

### Integration tests are read-only by design
The API has no mutable state: queries don't persist, no records are created or
deleted. Cleanup is therefore trivially satisfied β€” there is nothing to clean up.
This is called out explicitly because it's a deliberate architectural choice, not
an oversight. A stateful API (task creation, deletion) would require explicit
teardown fixtures.

---

## NLI model selection β€” what was tried and why

The faithfulness grader went through three models before converging:

**Vectara HHEM v2** (`vectara/hallucination_evaluation_model`) β€” purpose-built for RAG
faithfulness, not general NLI. The correct model for this task. Unusable: the checkpoint
is missing `t5.transformer.encoder.embed_tokens.weight`. The embedding matrix is
zero-initialized (`std=0.0`), producing constant 0.502 probability for every input.
Diagnosed via weight inspection, not error message.

**`cross-encoder/nli-deberta-v3-small`** (first attempt, paragraph-level) β€” 3-class NLI
(contradiction / entailment / neutral). Correct model family, wrong input format.
NLI cross-encoders are trained on sentence-pair inputs (SNLI/MNLI). Feeding a 3–4
sentence KB paragraph as the premise causes entailment scores to collapse β€” verbatim
text scores `ent=0.002`, treated as neutral. Root cause: model distributes probability
across longer sequences in ways not seen during training.

**`cross-encoder/nli-deberta-v3-small` (sentence-level)** β€” same model, fixed by splitting
KB chunks into individual sentences before scoring. Verbatim: `ent=0.995`. Aliased terms
("item registry" vs "product catalog (item registry)"): `ent=0.989`. Hallucinated facts:
`entβ‰ˆ0.000`, contradictionβ‰ˆ1.0. This is the current implementation.

**Key insight:** the NLI model selection problem is a data format problem as much as a
model selection problem. The same model produces correct results at sentence level and
degenerate results at paragraph level.

---

## Alternative judge approaches considered

### Ollama (local LLM judge)
Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
requires local GPU or accepts slower CPU inference; no external API rate limits;
outputs are fully reproducible since the model version is pinned. For the
faithfulness judge specifically, a local `llama3` via Ollama would remove the
dependency on HF token entirely and allow offline eval runs.

### Prometheus (LLM eval framework)
[Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a
7B model fine-tuned specifically for evaluation tasks β€” outputs a score + rationale
in a structured format designed for rubric-based grading. It's a drop-in replacement
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
The tradeoff vs. the current sentence-level NLI approach: Prometheus is slower (7B vs
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
which is more interpretable for audit and debugging.

**Why not used here:** the cross-encoder NLI approach runs faster and requires no prompt
engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.

---

## What another 4 hours would add

- **`eval/metrics.py` β€” L2 LLM metrics**: contextual precision (chunk ranking),
  contextual recall (coverage), and answer correctness against full reference answers.
  Currently only keyphrase coverage is used as a proxy.
- **Async client**: `httpx.AsyncClient` variant for high-concurrency load testing.
- **Property-based tests**: `hypothesis` to fuzz `check_terminology` and graders
  with generated strings β€” catches edge cases the golden dataset doesn't cover.
- **CI pipeline**: GitHub Actions running `make lint`, `make type-check`,
  `make test` on every PR. Integration tests gated on a self-hosted runner with
  the API running.
- **Threshold calibration report**: `eval/calibrate.py` exists and runs graders
  against golden-dataset expected answers β€” threshold calibration is now a single
  command, not a missing feature. Actual threshold adjustments require reviewing
  the output against real query distributions.

## Gate 5 audit gaps addressed

- **Faithfulness false negatives on refusals**: `_is_refusal()` detects "I don't have
  enough information" responses and returns score=1.0 β€” no factual claims, trivially faithful.
- **Partial grounding blind spot**: faithfulness now uses claim-level decomposition
  (`grade_faithfulness_decomposed`). Response split into sentences; each verified
  independently. Score = supported_claims / total_claims. A response with one hallucinated
  sentence in three now scores 0.667, not 1.0.
- **No escalation path**: `overall_pass=False` now emits a structured `EVAL_FAIL` WARNING
  log entry and sets `flagged: true` in the response payload. UI shows a red banner.
- **Cold-start latency**: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
- **Happy-path-only golden dataset**: 4 adversarial pairs added (vague query, rival-term
  prompt injection, multi-doc synthesis, hallucination bait).
- **No drift detection**: added `eval/drift.py` β€” KS two-sample test per metric, compares
  live telemetry scores against golden-dataset baseline. Detects faithfulness degradation
  at p < 0.05 with ~40% traffic degradation across 40+ events.

---

## Where LLM assistance helped and where it misled

**Helped:**
- Scaffolding the full project structure (backend, client, tests, config) in a
  single session without losing consistency across files.
- Writing the faithfulness prompt in a way that reliably returns structured JSON β€”
  the few-shot JSON format in the prompt was a suggested pattern that works.
- Catching that `except Exception` in the faithfulness grader was too broad and
  replacing it with `(json.JSONDecodeError, anthropic.APIError)`.
- Identifying that `_build_index_by_domain` was defined twice in pipeline.py
  (duplicate introduced during an edit session) β€” caught during code review.

**Misled or required correction:**
- Initially used `lru_cache` on a function that takes a `SentenceTransformer`
  instance as an argument β€” unhashable, so the cache silently failed. Required
  switching to a module-level dict cache.
- Generated a dead loop in `rosetta.py` (iterating over terms with `continue`
  but no code after the continue branch) that did nothing. The logic existed in
  a comment describing intent but was never implemented. Caught in review.
- Suggested a fictional client name that conflicted with a real company.
  Required renaming before the repo went public.