# 03 — Evaluation Plan

> ⚠️ **Methodology is current; some implementation pointers are historical.**
> The gold-set construction, grader signals, and metrics below still
> describe how we evaluate the bot. But the *system under test* is now the
> single-LLM-with-tools handler (`backend/single_brain.py` — one Gemini
> 2.5-flash call per turn with `save_profile_field` / `retrieve_policies` /
> `mark_recommendation`, structured+vector retrieval, small
> `nim_fallback.py` for transient errors). There is no `orchestrator`, no
> `sales_brain`/`qa_brain` split, no separate faithfulness-judge LLM, and no
> DuckDB hot path. Present-state authority: [`README.md`](../../README.md) §4.

| Field | Value |
| --- | --- |
| Project | Insurance Sales Portfolio Expert |
| Version | 0.1 |
| Date | 2026-05-13 |
| Depends on | `01-requirements.md` §6 (success criteria); README §4 (system under test) |
| Status | Pipeline implemented; full run pending corpus completion |

## 0. Purpose

Evaluation is **how we prove the bot works**. Three artifacts:

1. **Gold Q&A set** — ground-truth question-answer-source triples
2. **Automated grader** — measures factual accuracy + citation accuracy + refusal precision per turn
3. **Results table** — versioned, audit-grade record of every eval run

These directly back the success criteria in Doc 01 §6: C2 (≥95% factual), C3 (≥95% citation), C4 (≥90% refusal precision).

## 1. Gold Q&A construction — three pipelines

### Pipeline A — auto-generated from structured extraction (the bulk)

For every successfully extracted policy, generate templated Q&A where the answer comes directly from the 62-field structured extraction. Per `eval/generate_gold.py`:

- ~15 question templates × ~80 policies = ~1,100 candidate pairs
- Each pair is **fully reproducible** — the answer traces to a specific field which traces to a specific clause
- Auto-tagged: `question_type` ∈ {waiting_period, coverage_scope, exclusions, sub_limit, eligibility, claim, network, bonus}

**Why this scales:** adding the 11th, 12th, 80th policy adds zero new template work. Same generator, more rows.

### Pipeline B — LLM-drafted nuanced questions (curated)

For top-priority policies (top 10–20), run an LLM on the policy text with prompt: *"generate 5 buyer-style questions whose answers are explicitly in this document; include the source clause."* Each generated pair is **human spot-checked** before commit.

Target: 5 × 20 = 100 questions covering multi-clause and edge-case reasoning that Pipeline A can't generate.

### Pipeline C — hand-crafted adversarial set (refusal tests)

~30–40 hand-written questions across these classes:

- **Out-of-corpus** ("Does this cover space tourism?") → expect refusal
- **Out-of-policy-type** ("What is the IRDAI mandate on dental coverage?") → expect refusal (D-017)
- **Multi-policy compare** ("Compare cancer coverage in Star vs HDFC ERGO") → expect cited comparison
- **Hinglish** ("Cataract ke liye waiting period kya hai?") → expect Hindi answer, same factual accuracy
- **Code-switched** ("policy mein maternity cover hai kya?") → expect Hinglish answer

Each pair marked with `expected_refusal: bool`. Refusal cases test that the bot doesn't hallucinate when grounding fails.

## 2. Grader design

**Single grading endpoint:** `eval/run.py` calls the single-LLM-with-tools turn handler (`backend/single_brain.py`) in-process, takes the reply, scores against gold with three signals:

### Signal 1 — Regex hard-checks (deterministic)

Numbers, dates, currency, durations, percentages are extracted via regex from both gold and bot reply. Exact match (after normalization) is the strongest signal. Catches "the premium is ₹15,000" hallucinations without an LLM.

### Signal 2 — LLM-judge faithfulness (offline eval grader only)

The eval grader uses a **different model family** from the runtime Gemini 2.5-flash brain → **non-circular evaluation**. This judge is part of the *offline eval harness only* — it is not a runtime gate in the bot (runtime grounding is structural: the single LLM can only state what `retrieve_policies` returned and must cite it). Judge prompt:

> Given GOLD and BOT answers, output strict JSON: `{factual_match: bool, citation_present: bool, score: 0-1, reason: str}`. Be strict on partial matches.

### Signal 3 — Citation regex check

`[Source: ...]` tag must be present in BOT for non-refusal questions. Caught by `re.search(r'\[Source:'` pattern.

**Final per-question score:** `factual_match AND (citation_present OR expected_refusal)`.

## 3. Metrics computed

| Metric | Doc 01 target | How |
| --- | --- | --- |
| Factual accuracy | C2 ≥ 95% | n_correct / n_total |
| Citation accuracy | C3 ≥ 95% | n_correct_citations / n_non_refusal |
| Refusal precision | C4 ≥ 90% | n_correct_refusals / n_expected_refusals |
| Hindi parity | C8 within 5pp | factual_acc(hi) vs factual_acc(en) |
| Path winners | (config) | factual_acc grouped by which path produced the turn — `gemini` (primary single-LLM call) vs `nim_fallback` (transient-error fallback) |

## 4. Output artifacts per run

- `eval/results.md` — human-readable summary with per-type, per-path accuracy + sample misses
- `eval/results.json` — machine-readable full per-question record
- `logs/hallucinations.jsonl` — every blocked reply with its reason (audit log)

## 5. Run cadence

| Stage | Cadence | Implementation |
| --- | --- | --- |
| Development | Manual, after meaningful changes | `python -m eval.run` |
| Pre-deploy | Every PR | GitHub Actions runs eval, blocks merge if accuracy regresses |
| Production | Nightly synthetic + spot-grading | Scheduled job; live-traffic sampling via Playwright |
| Post-deploy verification | Per deploy | `tests/live_verify.py` runs full eval against the live HF Space URL |

## 6. Bilingual eval

Bilingual sub-set: 20 questions translated to Hindi + Hinglish via Sarvam-M with manual spot-check. Run separately. Hindi factual accuracy compared to English baseline — must be within 5pp per C8.

## 7. Known limitations (transparent)

- **Sample bias:** Pipeline A questions are templated, so accuracy on Pipeline A is upper-bound real-world performance. Pipeline B + C provide the harder signal.
- **Single judge model:** the offline eval grader uses one judge model (a different family from the runtime Gemini brain) per question. Risk of judge-specific bias. v2: 3-judge consensus.
- **No human evaluation:** Cost-prohibitive for 1,100 questions. We rely on the grader; manual spot-check 5%.
- **No latency budget enforcement in eval:** Latency is captured per record but doesn't gate the score. Doc 01 C1 (p50 ≤ 4s) is monitored separately.