InsuranceBot / eval /README.md
rohitsar567's picture
feat(llm+docs): KI-177 + KI-179 + KI-183 + ADR-040 docs cascade
132a829
|
Raw
History Blame Contribute Delete
3.91 kB

eval/ β€” Gold-QA harness

Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.

eval/ is the objective quality bar; tools/audit/ is the behavioural stress test. Both feed the readiness register in 80-audit/ENTERPRISE_AUDIT.md.

Files

File Role
generate_gold.py Builds gold_qa.json from 40-data/policy_facts/: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation.
gold_qa.json 96-Q gold set. Each entry: {policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}.
run.py Runner. For each pair: calls backend.orchestrator.handle_turn with policy_filter_ids=[pair.policy_id] so retrieval is scoped to the policy under test. Grades each reply twice β€” regex hard-facts + JUDGE_CHAIN (NIM Mistral Large 3 675B primary; different family from the Gemini brain β†’ non-circular). Falls through to NIM Llama-4 Maverick β†’ OpenRouter Qwen 80B :free β†’ NIM Nemotron 49B at the tail. Emits brain_used like sales_brain::continue / sales_brain::complete / brain_main::<elected_model> / brain_main::graceful_error.
results.json Machine-readable last-run results.
results.md Human-readable last-run report β€” per-question pass/fail, per-category rollup, hallucination breakdown.
info_source_map.json Generated by tools/info_source_map.py. Claim β†’ URL β†’ verdict (βœ… 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI.
verified_urls.json HEAD-check verdict on every URL in the corpus / facts. Generated by tools/verify_urls.py.
reviews_url_verification.json URL-validation output for 40-data/reviews/<insurer>.json.
chunk_sweep_results.json, chunk_diagnostic.json Outputs of tools/chunk_sweep.py β€” chunk-size / overlap grid. See ADR-018.

Usage

# Full 96-Q eval
python -m eval.run

# Smoke
python -m eval.run --limit 30

# Scoped to one policy
python -m eval.run --policy hdfc-ergo__optima-secure

# Regenerate the gold set after curation refresh
python -m eval.generate_gold

Grading

Layer What it checks Output field
Regex hard-facts Expected numeric / boolean / waiting-period value appears verbatim in reply_text. regex_pass
LLM judge (JUDGE_CHAIN: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B :free / NIM Nemotron 49B fallbacks per ADR-040) Reply semantically answers the question and cites the expected source. Different family from the Gemini brain β†’ non-circular. judge_pass
Faithfulness gate The 4-gate guard in backend/faithfulness.py already ran during handle_turn; eval records whether it blocked. blocked

Acceptance bar

Per 80-audit/ENTERPRISE_AUDIT.md D-003: β‰₯ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment. Current state lives in results.md.

Related

  • backend/faithfulness.py β€” the 4-gate guard that produces the blocked verdict
  • kb/AUDIT_TRAIL.md Β§ "Live bot replies" β€” provenance trail for any flagged reply
  • tools/info_source_map.py β€” generator of info_source_map.json
  • ADR-014 (superseded β€” but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
  • ADR-040 β€” current chain shape (Google β†’ NIM β†’ OpenRouter)