Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / 70-docs /40-evaluation /eval-methodology.md

rohitsar567

chore(cleanup): purge stale narrative/tombstones/dead code — codebase reads as the current standard

23b8fad about 2 months ago

preview code

Raw

History Blame Contribute Delete

6.72 kB

03 — Evaluation Plan

⚠️ Methodology is current; some implementation pointers are historical. The gold-set construction, grader signals, and metrics below still describe how we evaluate the bot. But the system under test is now the single-LLM-with-tools handler (backend/single_brain.py — one Gemini 2.5-flash call per turn with save_profile_field / retrieve_policies / mark_recommendation, structured+vector retrieval, small nim_fallback.py for transient errors). There is no orchestrator, no sales_brain/qa_brain split, no separate faithfulness-judge LLM, and no DuckDB hot path. Present-state authority: README.md §4.

Field	Value
Project	Insurance Sales Portfolio Expert
Version	0.1
Date	2026-05-13
Depends on	`01-requirements.md` §6 (success criteria); README §4 (system under test)
Status	Pipeline implemented; full run pending corpus completion

0. Purpose

Evaluation is how we prove the bot works. Three artifacts:

Gold Q&A set — ground-truth question-answer-source triples
Automated grader — measures factual accuracy + citation accuracy + refusal precision per turn
Results table — versioned, audit-grade record of every eval run

These directly back the success criteria in Doc 01 §6: C2 (≥95% factual), C3 (≥95% citation), C4 (≥90% refusal precision).

1. Gold Q&A construction — three pipelines

Pipeline A — auto-generated from structured extraction (the bulk)

For every successfully extracted policy, generate templated Q&A where the answer comes directly from the 62-field structured extraction. Per eval/generate_gold.py:

~15 question templates × ~80 policies = ~1,100 candidate pairs
Each pair is fully reproducible — the answer traces to a specific field which traces to a specific clause
Auto-tagged: question_type ∈ {waiting_period, coverage_scope, exclusions, sub_limit, eligibility, claim, network, bonus}

Why this scales: adding the 11th, 12th, 80th policy adds zero new template work. Same generator, more rows.

Pipeline B — LLM-drafted nuanced questions (curated)

For top-priority policies (top 10–20), run an LLM on the policy text with prompt: "generate 5 buyer-style questions whose answers are explicitly in this document; include the source clause." Each generated pair is human spot-checked before commit.

Target: 5 × 20 = 100 questions covering multi-clause and edge-case reasoning that Pipeline A can't generate.

Pipeline C — hand-crafted adversarial set (refusal tests)

~30–40 hand-written questions across these classes:

Out-of-corpus ("Does this cover space tourism?") → expect refusal
Out-of-policy-type ("What is the IRDAI mandate on dental coverage?") → expect refusal (D-017)
Multi-policy compare ("Compare cancer coverage in Star vs HDFC ERGO") → expect cited comparison
Hinglish ("Cataract ke liye waiting period kya hai?") → expect Hindi answer, same factual accuracy
Code-switched ("policy mein maternity cover hai kya?") → expect Hinglish answer

Each pair marked with expected_refusal: bool. Refusal cases test that the bot doesn't hallucinate when grounding fails.

2. Grader design

Single grading endpoint: eval/run.py calls the single-LLM-with-tools turn handler (backend/single_brain.py) in-process, takes the reply, scores against gold with three signals:

Signal 1 — Regex hard-checks (deterministic)

Numbers, dates, currency, durations, percentages are extracted via regex from both gold and bot reply. Exact match (after normalization) is the strongest signal. Catches "the premium is ₹15,000" hallucinations without an LLM.

Signal 2 — LLM-judge faithfulness (offline eval grader only)

The eval grader uses a different model family from the runtime Gemini 2.5-flash brain → non-circular evaluation. This judge is part of the offline eval harness only — it is not a runtime gate in the bot (runtime grounding is structural: the single LLM can only state what retrieve_policies returned and must cite it). Judge prompt:

Given GOLD and BOT answers, output strict JSON: {factual_match: bool, citation_present: bool, score: 0-1, reason: str}. Be strict on partial matches.

Signal 3 — Citation regex check

[Source: ...] tag must be present in BOT for non-refusal questions. Caught by re.search(r'\[Source:' pattern.

Final per-question score: factual_match AND (citation_present OR expected_refusal).

3. Metrics computed

Metric	Doc 01 target	How
Factual accuracy	C2 ≥ 95%	n_correct / n_total
Citation accuracy	C3 ≥ 95%	n_correct_citations / n_non_refusal
Refusal precision	C4 ≥ 90%	n_correct_refusals / n_expected_refusals
Hindi parity	C8 within 5pp	factual_acc(hi) vs factual_acc(en)
Path winners	(config)	factual_acc grouped by which path produced the turn — `gemini` (primary single-LLM call) vs `nim_fallback` (transient-error fallback)

4. Output artifacts per run

eval/results.md — human-readable summary with per-type, per-path accuracy + sample misses
eval/results.json — machine-readable full per-question record
logs/hallucinations.jsonl — every blocked reply with its reason (audit log)

5. Run cadence

Stage	Cadence	Implementation
Development	Manual, after meaningful changes	`python -m eval.run`
Pre-deploy	Every PR	GitHub Actions runs eval, blocks merge if accuracy regresses
Production	Nightly synthetic + spot-grading	Scheduled job; live-traffic sampling via Playwright
Post-deploy verification	Per deploy	`tests/live_verify.py` runs full eval against the live HF Space URL

6. Bilingual eval

Bilingual sub-set: 20 questions translated to Hindi + Hinglish via Sarvam-M with manual spot-check. Run separately. Hindi factual accuracy compared to English baseline — must be within 5pp per C8.

7. Known limitations (transparent)

Sample bias: Pipeline A questions are templated, so accuracy on Pipeline A is upper-bound real-world performance. Pipeline B + C provide the harder signal.
Single judge model: the offline eval grader uses one judge model (a different family from the runtime Gemini brain) per question. Risk of judge-specific bias. v2: 3-judge consensus.
No human evaluation: Cost-prohibitive for 1,100 questions. We rely on the grader; manual spot-check 5%.
No latency budget enforcement in eval: Latency is captured per record but doesn't gate the score. Doc 01 C1 (p50 ≤ 4s) is monitored separately.