Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / eval /README.md

rohitsar567

feat(llm+docs): KI-177 + KI-179 + KI-183 + ADR-040 docs cascade

132a829 about 2 months ago

preview code

Raw

History Blame Contribute Delete

3.91 kB

	# `eval/` — Gold-QA harness

	Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.

	`eval/` is the objective quality bar; `tools/audit/` is the behavioural stress test. Both feed the readiness register in `80-audit/ENTERPRISE_AUDIT.md`.

	## Files

	\| File \| Role \|
	\| --- \| --- \|
	\| `generate_gold.py` \| Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. \|
	\| `gold_qa.json` \| 96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`. \|
	\| `run.py` \| Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice — regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain → non-circular). Falls through to NIM Llama-4 Maverick → OpenRouter Qwen 80B `:free` → NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::<elected_model>` / `brain_main::graceful_error`. \|
	\| `results.json` \| Machine-readable last-run results. \|
	\| `results.md` \| Human-readable last-run report — per-question pass/fail, per-category rollup, hallucination breakdown. \|
	\| `info_source_map.json` \| Generated by `tools/info_source_map.py`. Claim → URL → verdict (✅ 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI. \|
	\| `verified_urls.json` \| HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`. \|
	\| `reviews_url_verification.json` \| URL-validation output for `40-data/reviews/<insurer>.json`. \|
	\| `chunk_sweep_results.json`, `chunk_diagnostic.json` \| Outputs of `tools/chunk_sweep.py` — chunk-size / overlap grid. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). \|

	## Usage

	```bash
	# Full 96-Q eval
	python -m eval.run

	# Smoke
	python -m eval.run --limit 30

	# Scoped to one policy
	python -m eval.run --policy hdfc-ergo__optima-secure

	# Regenerate the gold set after curation refresh
	python -m eval.generate_gold
	```

	## Grading

	\| Layer \| What it checks \| Output field \|
	\| --- \| --- \| --- \|
	\| Regex hard-facts \| Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`. \| `regex_pass` \|
	\| LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) \| Reply semantically answers the question and cites the expected source. Different family from the Gemini brain → non-circular. \| `judge_pass` \|
	\| Faithfulness gate \| The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked. \| `blocked` \|

	## Acceptance bar

	Per `80-audit/ENTERPRISE_AUDIT.md` D-003: ≥ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment. Current state lives in `results.md`.

	## Related

	- [`backend/faithfulness.py`](../backend/faithfulness.py) — the 4-gate guard that produces the `blocked` verdict
	- [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) § "Live bot replies" — provenance trail for any flagged reply
	- `tools/info_source_map.py` — generator of `info_source_map.json`
	- [ADR-014](../70-docs/60-decisions/ADR-014-groq-llama-grader.md) (superseded — but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
	- [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md) — current chain shape (Google → NIM → OpenRouter)