Spaces:
Sleeping
Sleeping
docs: KI-071 β reconcile docs with code reality (Qwen 80B / Nemotron 30B / Mistral 675B as current primaries)
2e60476 | # Eval β Gold Q&A + Run History | |
| _Auto-generated. Source: `eval/gold_qa.json` + `eval/results.json` + `eval/run.py`._ | |
| ## Gold Q&A composition β 96 pairs total | |
| | Type | Count | | |
| | --- | --- | | |
| | `waiting_period` | 27 | | |
| | `coverage_scope` | 21 | | |
| | `exclusions_oos` | 20 | | |
| | `sub_limit` | 12 | | |
| | `regulatory_oos` | 10 | | |
| | `bonus` | 6 | | |
| **Refusal-test questions:** 30 (these test the bot correctly refuses out-of-corpus questions) | |
| ## Most recent eval run | |
| - Ran: 2026-05-12T22:30:15Z | |
| - Questions: 25 | |
| - Factual accuracy: **40.0%** | |
| - Citation accuracy: **50.0%** | |
| - Refusal precision: **44.4%** | |
| - Blocked by faithfulness: 12 | |
| ## Methodology | |
| - Gold Q&A built by 3 pipelines: auto-from-extraction (templated), LLM-drafted (human-verified), hand-crafted adversarial. See `70-docs/03-eval-plan.md`. | |
| - Grader: NIM Mistral Large 3 675B (Mistral; primary judge per D-022) β different family from the Qwen 3-Next 80B brain β non-circular (D-019, 2026-05-14). The earlier Groq Llama grader was retired in the same consolidation. | |
| - Re-run: `python -m eval.run [--limit N] [--policy <id>]`. | |
| - CI gate: `.github/workflows/eval.yml` runs eval on every PR; blocks merge if factual_accuracy < 0.65 or citation_accuracy < 0.55. |