Spaces:
Sleeping
Sleeping
chore(cleanup): purge stale narrative/tombstones/dead code β codebase reads as the current standard
23b8fad | # 03 β Evaluation Plan | |
| > β οΈ **Methodology is current; some implementation pointers are historical.** | |
| > The gold-set construction, grader signals, and metrics below still | |
| > describe how we evaluate the bot. But the *system under test* is now the | |
| > single-LLM-with-tools handler (`backend/single_brain.py` β one Gemini | |
| > 2.5-flash call per turn with `save_profile_field` / `retrieve_policies` / | |
| > `mark_recommendation`, structured+vector retrieval, small | |
| > `nim_fallback.py` for transient errors). There is no `orchestrator`, no | |
| > `sales_brain`/`qa_brain` split, no separate faithfulness-judge LLM, and no | |
| > DuckDB hot path. Present-state authority: [`README.md`](../../README.md) Β§4. | |
| | Field | Value | | |
| | --- | --- | | |
| | Project | Insurance Sales Portfolio Expert | | |
| | Version | 0.1 | | |
| | Date | 2026-05-13 | | |
| | Depends on | `01-requirements.md` Β§6 (success criteria); README Β§4 (system under test) | | |
| | Status | Pipeline implemented; full run pending corpus completion | | |
| ## 0. Purpose | |
| Evaluation is **how we prove the bot works**. Three artifacts: | |
| 1. **Gold Q&A set** β ground-truth question-answer-source triples | |
| 2. **Automated grader** β measures factual accuracy + citation accuracy + refusal precision per turn | |
| 3. **Results table** β versioned, audit-grade record of every eval run | |
| These directly back the success criteria in Doc 01 Β§6: C2 (β₯95% factual), C3 (β₯95% citation), C4 (β₯90% refusal precision). | |
| ## 1. Gold Q&A construction β three pipelines | |
| ### Pipeline A β auto-generated from structured extraction (the bulk) | |
| For every successfully extracted policy, generate templated Q&A where the answer comes directly from the 62-field structured extraction. Per `eval/generate_gold.py`: | |
| - ~15 question templates Γ ~80 policies = ~1,100 candidate pairs | |
| - Each pair is **fully reproducible** β the answer traces to a specific field which traces to a specific clause | |
| - Auto-tagged: `question_type` β {waiting_period, coverage_scope, exclusions, sub_limit, eligibility, claim, network, bonus} | |
| **Why this scales:** adding the 11th, 12th, 80th policy adds zero new template work. Same generator, more rows. | |
| ### Pipeline B β LLM-drafted nuanced questions (curated) | |
| For top-priority policies (top 10β20), run an LLM on the policy text with prompt: *"generate 5 buyer-style questions whose answers are explicitly in this document; include the source clause."* Each generated pair is **human spot-checked** before commit. | |
| Target: 5 Γ 20 = 100 questions covering multi-clause and edge-case reasoning that Pipeline A can't generate. | |
| ### Pipeline C β hand-crafted adversarial set (refusal tests) | |
| ~30β40 hand-written questions across these classes: | |
| - **Out-of-corpus** ("Does this cover space tourism?") β expect refusal | |
| - **Out-of-policy-type** ("What is the IRDAI mandate on dental coverage?") β expect refusal (D-017) | |
| - **Multi-policy compare** ("Compare cancer coverage in Star vs HDFC ERGO") β expect cited comparison | |
| - **Hinglish** ("Cataract ke liye waiting period kya hai?") β expect Hindi answer, same factual accuracy | |
| - **Code-switched** ("policy mein maternity cover hai kya?") β expect Hinglish answer | |
| Each pair marked with `expected_refusal: bool`. Refusal cases test that the bot doesn't hallucinate when grounding fails. | |
| ## 2. Grader design | |
| **Single grading endpoint:** `eval/run.py` calls the single-LLM-with-tools turn handler (`backend/single_brain.py`) in-process, takes the reply, scores against gold with three signals: | |
| ### Signal 1 β Regex hard-checks (deterministic) | |
| Numbers, dates, currency, durations, percentages are extracted via regex from both gold and bot reply. Exact match (after normalization) is the strongest signal. Catches "the premium is βΉ15,000" hallucinations without an LLM. | |
| ### Signal 2 β LLM-judge faithfulness (offline eval grader only) | |
| The eval grader uses a **different model family** from the runtime Gemini 2.5-flash brain β **non-circular evaluation**. This judge is part of the *offline eval harness only* β it is not a runtime gate in the bot (runtime grounding is structural: the single LLM can only state what `retrieve_policies` returned and must cite it). Judge prompt: | |
| > Given GOLD and BOT answers, output strict JSON: `{factual_match: bool, citation_present: bool, score: 0-1, reason: str}`. Be strict on partial matches. | |
| ### Signal 3 β Citation regex check | |
| `[Source: ...]` tag must be present in BOT for non-refusal questions. Caught by `re.search(r'\[Source:'` pattern. | |
| **Final per-question score:** `factual_match AND (citation_present OR expected_refusal)`. | |
| ## 3. Metrics computed | |
| | Metric | Doc 01 target | How | | |
| | --- | --- | --- | | |
| | Factual accuracy | C2 β₯ 95% | n_correct / n_total | | |
| | Citation accuracy | C3 β₯ 95% | n_correct_citations / n_non_refusal | | |
| | Refusal precision | C4 β₯ 90% | n_correct_refusals / n_expected_refusals | | |
| | Hindi parity | C8 within 5pp | factual_acc(hi) vs factual_acc(en) | | |
| | Path winners | (config) | factual_acc grouped by which path produced the turn β `gemini` (primary single-LLM call) vs `nim_fallback` (transient-error fallback) | | |
| ## 4. Output artifacts per run | |
| - `eval/results.md` β human-readable summary with per-type, per-path accuracy + sample misses | |
| - `eval/results.json` β machine-readable full per-question record | |
| - `logs/hallucinations.jsonl` β every blocked reply with its reason (audit log) | |
| ## 5. Run cadence | |
| | Stage | Cadence | Implementation | | |
| | --- | --- | --- | | |
| | Development | Manual, after meaningful changes | `python -m eval.run` | | |
| | Pre-deploy | Every PR | GitHub Actions runs eval, blocks merge if accuracy regresses | | |
| | Production | Nightly synthetic + spot-grading | Scheduled job; live-traffic sampling via Playwright | | |
| | Post-deploy verification | Per deploy | `tests/live_verify.py` runs full eval against the live HF Space URL | | |
| ## 6. Bilingual eval | |
| Bilingual sub-set: 20 questions translated to Hindi + Hinglish via Sarvam-M with manual spot-check. Run separately. Hindi factual accuracy compared to English baseline β must be within 5pp per C8. | |
| ## 7. Known limitations (transparent) | |
| - **Sample bias:** Pipeline A questions are templated, so accuracy on Pipeline A is upper-bound real-world performance. Pipeline B + C provide the harder signal. | |
| - **Single judge model:** the offline eval grader uses one judge model (a different family from the runtime Gemini brain) per question. Risk of judge-specific bias. v2: 3-judge consensus. | |
| - **No human evaluation:** Cost-prohibitive for 1,100 questions. We rely on the grader; manual spot-check 5%. | |
| - **No latency budget enforcement in eval:** Latency is captured per record but doesn't gate the score. Doc 01 C1 (p50 β€ 4s) is monitored separately. | |