Spaces:
Sleeping
03 β Evaluation Plan
β οΈ Methodology is current; some implementation pointers are historical. The gold-set construction, grader signals, and metrics below still describe how we evaluate the bot. But the system under test is now the single-LLM-with-tools handler (
backend/single_brain.pyβ one Gemini 2.5-flash call per turn withsave_profile_field/retrieve_policies/mark_recommendation, structured+vector retrieval, smallnim_fallback.pyfor transient errors). There is noorchestrator, nosales_brain/qa_brainsplit, no separate faithfulness-judge LLM, and no DuckDB hot path. Present-state authority:README.mdΒ§4.
| Field | Value |
|---|---|
| Project | Insurance Sales Portfolio Expert |
| Version | 0.1 |
| Date | 2026-05-13 |
| Depends on | 01-requirements.md Β§6 (success criteria); README Β§4 (system under test) |
| Status | Pipeline implemented; full run pending corpus completion |
0. Purpose
Evaluation is how we prove the bot works. Three artifacts:
- Gold Q&A set β ground-truth question-answer-source triples
- Automated grader β measures factual accuracy + citation accuracy + refusal precision per turn
- Results table β versioned, audit-grade record of every eval run
These directly back the success criteria in Doc 01 Β§6: C2 (β₯95% factual), C3 (β₯95% citation), C4 (β₯90% refusal precision).
1. Gold Q&A construction β three pipelines
Pipeline A β auto-generated from structured extraction (the bulk)
For every successfully extracted policy, generate templated Q&A where the answer comes directly from the 62-field structured extraction. Per eval/generate_gold.py:
- ~15 question templates Γ ~80 policies = ~1,100 candidate pairs
- Each pair is fully reproducible β the answer traces to a specific field which traces to a specific clause
- Auto-tagged:
question_typeβ {waiting_period, coverage_scope, exclusions, sub_limit, eligibility, claim, network, bonus}
Why this scales: adding the 11th, 12th, 80th policy adds zero new template work. Same generator, more rows.
Pipeline B β LLM-drafted nuanced questions (curated)
For top-priority policies (top 10β20), run an LLM on the policy text with prompt: "generate 5 buyer-style questions whose answers are explicitly in this document; include the source clause." Each generated pair is human spot-checked before commit.
Target: 5 Γ 20 = 100 questions covering multi-clause and edge-case reasoning that Pipeline A can't generate.
Pipeline C β hand-crafted adversarial set (refusal tests)
~30β40 hand-written questions across these classes:
- Out-of-corpus ("Does this cover space tourism?") β expect refusal
- Out-of-policy-type ("What is the IRDAI mandate on dental coverage?") β expect refusal (D-017)
- Multi-policy compare ("Compare cancer coverage in Star vs HDFC ERGO") β expect cited comparison
- Hinglish ("Cataract ke liye waiting period kya hai?") β expect Hindi answer, same factual accuracy
- Code-switched ("policy mein maternity cover hai kya?") β expect Hinglish answer
Each pair marked with expected_refusal: bool. Refusal cases test that the bot doesn't hallucinate when grounding fails.
2. Grader design
Single grading endpoint: eval/run.py calls the single-LLM-with-tools turn handler (backend/single_brain.py) in-process, takes the reply, scores against gold with three signals:
Signal 1 β Regex hard-checks (deterministic)
Numbers, dates, currency, durations, percentages are extracted via regex from both gold and bot reply. Exact match (after normalization) is the strongest signal. Catches "the premium is βΉ15,000" hallucinations without an LLM.
Signal 2 β LLM-judge faithfulness (offline eval grader only)
The eval grader uses a different model family from the runtime Gemini 2.5-flash brain β non-circular evaluation. This judge is part of the offline eval harness only β it is not a runtime gate in the bot (runtime grounding is structural: the single LLM can only state what retrieve_policies returned and must cite it). Judge prompt:
Given GOLD and BOT answers, output strict JSON:
{factual_match: bool, citation_present: bool, score: 0-1, reason: str}. Be strict on partial matches.
Signal 3 β Citation regex check
[Source: ...] tag must be present in BOT for non-refusal questions. Caught by re.search(r'\[Source:' pattern.
Final per-question score: factual_match AND (citation_present OR expected_refusal).
3. Metrics computed
| Metric | Doc 01 target | How |
|---|---|---|
| Factual accuracy | C2 β₯ 95% | n_correct / n_total |
| Citation accuracy | C3 β₯ 95% | n_correct_citations / n_non_refusal |
| Refusal precision | C4 β₯ 90% | n_correct_refusals / n_expected_refusals |
| Hindi parity | C8 within 5pp | factual_acc(hi) vs factual_acc(en) |
| Path winners | (config) | factual_acc grouped by which path produced the turn β gemini (primary single-LLM call) vs nim_fallback (transient-error fallback) |
4. Output artifacts per run
eval/results.mdβ human-readable summary with per-type, per-path accuracy + sample misseseval/results.jsonβ machine-readable full per-question recordlogs/hallucinations.jsonlβ every blocked reply with its reason (audit log)
5. Run cadence
| Stage | Cadence | Implementation |
|---|---|---|
| Development | Manual, after meaningful changes | python -m eval.run |
| Pre-deploy | Every PR | GitHub Actions runs eval, blocks merge if accuracy regresses |
| Production | Nightly synthetic + spot-grading | Scheduled job; live-traffic sampling via Playwright |
| Post-deploy verification | Per deploy | tests/live_verify.py runs full eval against the live HF Space URL |
6. Bilingual eval
Bilingual sub-set: 20 questions translated to Hindi + Hinglish via Sarvam-M with manual spot-check. Run separately. Hindi factual accuracy compared to English baseline β must be within 5pp per C8.
7. Known limitations (transparent)
- Sample bias: Pipeline A questions are templated, so accuracy on Pipeline A is upper-bound real-world performance. Pipeline B + C provide the harder signal.
- Single judge model: the offline eval grader uses one judge model (a different family from the runtime Gemini brain) per question. Risk of judge-specific bias. v2: 3-judge consensus.
- No human evaluation: Cost-prohibitive for 1,100 questions. We rely on the grader; manual spot-check 5%.
- No latency budget enforcement in eval: Latency is captured per record but doesn't gate the score. Doc 01 C1 (p50 β€ 4s) is monitored separately.