InsuranceBot / kb /eval /INDEX.md
rohitsar567's picture
docs: KI-071 β€” reconcile docs with code reality (Qwen 80B / Nemotron 30B / Mistral 675B as current primaries)
2e60476
|
Raw
History Blame Contribute Delete
1.22 kB

Eval β€” Gold Q&A + Run History

Auto-generated. Source: eval/gold_qa.json + eval/results.json + eval/run.py.

Gold Q&A composition β€” 96 pairs total

Type Count
waiting_period 27
coverage_scope 21
exclusions_oos 20
sub_limit 12
regulatory_oos 10
bonus 6

Refusal-test questions: 30 (these test the bot correctly refuses out-of-corpus questions)

Most recent eval run

  • Ran: 2026-05-12T22:30:15Z
  • Questions: 25
  • Factual accuracy: 40.0%
  • Citation accuracy: 50.0%
  • Refusal precision: 44.4%
  • Blocked by faithfulness: 12

Methodology

  • Gold Q&A built by 3 pipelines: auto-from-extraction (templated), LLM-drafted (human-verified), hand-crafted adversarial. See 70-docs/03-eval-plan.md.
  • Grader: NIM Mistral Large 3 675B (Mistral; primary judge per D-022) β€” different family from the Qwen 3-Next 80B brain β†’ non-circular (D-019, 2026-05-14). The earlier Groq Llama grader was retired in the same consolidation.
  • Re-run: python -m eval.run [--limit N] [--policy <id>].
  • CI gate: .github/workflows/eval.yml runs eval on every PR; blocks merge if factual_accuracy < 0.65 or citation_accuracy < 0.55.