Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / eval /README.md

rohitsar567

feat(llm+docs): KI-177 + KI-179 + KI-183 + ADR-040 docs cascade

132a829 about 2 months ago

preview code

Raw

History Blame Contribute Delete

3.91 kB

`eval/` — Gold-QA harness

Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.

eval/ is the objective quality bar; tools/audit/ is the behavioural stress test. Both feed the readiness register in 80-audit/ENTERPRISE_AUDIT.md.

Files

File	Role
`generate_gold.py`	Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation.
`gold_qa.json`	96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`.
`run.py`	Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice — regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain → non-circular). Falls through to NIM Llama-4 Maverick → OpenRouter Qwen 80B `:free` → NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::<elected_model>` / `brain_main::graceful_error`.
`results.json`	Machine-readable last-run results.
`results.md`	Human-readable last-run report — per-question pass/fail, per-category rollup, hallucination breakdown.
`info_source_map.json`	Generated by `tools/info_source_map.py`. Claim → URL → verdict (✅ 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI.
`verified_urls.json`	HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`.
`reviews_url_verification.json`	URL-validation output for `40-data/reviews/<insurer>.json`.
`chunk_sweep_results.json`, `chunk_diagnostic.json`	Outputs of `tools/chunk_sweep.py` — chunk-size / overlap grid. See ADR-018.

Usage

# Full 96-Q eval
python -m eval.run

# Smoke
python -m eval.run --limit 30

# Scoped to one policy
python -m eval.run --policy hdfc-ergo__optima-secure

# Regenerate the gold set after curation refresh
python -m eval.generate_gold

Grading

Layer	What it checks	Output field
Regex hard-facts	Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`.	`regex_pass`
LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per ADR-040)	Reply semantically answers the question and cites the expected source. Different family from the Gemini brain → non-circular.	`judge_pass`
Faithfulness gate	The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked.	`blocked`

Acceptance bar

Per 80-audit/ENTERPRISE_AUDIT.md D-003: ≥ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment. Current state lives in results.md.

backend/faithfulness.py — the 4-gate guard that produces the blocked verdict
kb/AUDIT_TRAIL.md § "Live bot replies" — provenance trail for any flagged reply
tools/info_source_map.py — generator of info_source_map.json
ADR-014 (superseded — but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
ADR-040 — current chain shape (Google → NIM → OpenRouter)

eval/ — Gold-QA harness

Files

Usage

Grading

Acceptance bar

Related

`eval/` — Gold-QA harness