InsuranceBot / eval /README.md
rohitsar567's picture
feat(llm+docs): KI-177 + KI-179 + KI-183 + ADR-040 docs cascade
132a829
|
Raw
History Blame Contribute Delete
3.91 kB
# `eval/` β€” Gold-QA harness
Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.
`eval/` is the **objective** quality bar; `tools/audit/` is the **behavioural** stress test. Both feed the readiness register in `80-audit/ENTERPRISE_AUDIT.md`.
## Files
| File | Role |
| --- | --- |
| `generate_gold.py` | Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. |
| `gold_qa.json` | 96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`. |
| `run.py` | Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice β€” regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain β†’ non-circular). Falls through to NIM Llama-4 Maverick β†’ OpenRouter Qwen 80B `:free` β†’ NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::<elected_model>` / `brain_main::graceful_error`. |
| `results.json` | Machine-readable last-run results. |
| `results.md` | Human-readable last-run report β€” per-question pass/fail, per-category rollup, hallucination breakdown. |
| `info_source_map.json` | Generated by `tools/info_source_map.py`. Claim β†’ URL β†’ verdict (βœ… 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI. |
| `verified_urls.json` | HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`. |
| `reviews_url_verification.json` | URL-validation output for `40-data/reviews/<insurer>.json`. |
| `chunk_sweep_results.json`, `chunk_diagnostic.json` | Outputs of `tools/chunk_sweep.py` β€” chunk-size / overlap grid. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). |
## Usage
```bash
# Full 96-Q eval
python -m eval.run
# Smoke
python -m eval.run --limit 30
# Scoped to one policy
python -m eval.run --policy hdfc-ergo__optima-secure
# Regenerate the gold set after curation refresh
python -m eval.generate_gold
```
## Grading
| Layer | What it checks | Output field |
| --- | --- | --- |
| Regex hard-facts | Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`. | `regex_pass` |
| LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) | Reply *semantically* answers the question and cites the expected source. Different family from the Gemini brain β†’ non-circular. | `judge_pass` |
| Faithfulness gate | The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked. | `blocked` |
## Acceptance bar
Per `80-audit/ENTERPRISE_AUDIT.md` D-003: **β‰₯ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment.** Current state lives in `results.md`.
## Related
- [`backend/faithfulness.py`](../backend/faithfulness.py) β€” the 4-gate guard that produces the `blocked` verdict
- [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) Β§ "Live bot replies" β€” provenance trail for any flagged reply
- `tools/info_source_map.py` β€” generator of `info_source_map.json`
- [ADR-014](../70-docs/60-decisions/ADR-014-groq-llama-grader.md) (superseded β€” but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
- [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md) β€” current chain shape (Google β†’ NIM β†’ OpenRouter)