Spaces:
Sleeping
Sleeping
docs: KI-071 β reconcile docs with code reality (Qwen 80B / Nemotron 30B / Mistral 675B as current primaries)
2e60476 Eval β Gold Q&A + Run History
Auto-generated. Source: eval/gold_qa.json + eval/results.json + eval/run.py.
Gold Q&A composition β 96 pairs total
| Type | Count |
|---|---|
waiting_period |
27 |
coverage_scope |
21 |
exclusions_oos |
20 |
sub_limit |
12 |
regulatory_oos |
10 |
bonus |
6 |
Refusal-test questions: 30 (these test the bot correctly refuses out-of-corpus questions)
Most recent eval run
- Ran: 2026-05-12T22:30:15Z
- Questions: 25
- Factual accuracy: 40.0%
- Citation accuracy: 50.0%
- Refusal precision: 44.4%
- Blocked by faithfulness: 12
Methodology
- Gold Q&A built by 3 pipelines: auto-from-extraction (templated), LLM-drafted (human-verified), hand-crafted adversarial. See
70-docs/03-eval-plan.md. - Grader: NIM Mistral Large 3 675B (Mistral; primary judge per D-022) β different family from the Qwen 3-Next 80B brain β non-circular (D-019, 2026-05-14). The earlier Groq Llama grader was retired in the same consolidation.
- Re-run:
python -m eval.run [--limit N] [--policy <id>]. - CI gate:
.github/workflows/eval.ymlruns eval on every PR; blocks merge if factual_accuracy < 0.65 or citation_accuracy < 0.55.