Spaces:
Sleeping
Sleeping
| # `eval/` β Gold-QA harness | |
| Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain. | |
| `eval/` is the **objective** quality bar; `tools/audit/` is the **behavioural** stress test. Both feed the readiness register in `80-audit/ENTERPRISE_AUDIT.md`. | |
| ## Files | |
| | File | Role | | |
| | --- | --- | | |
| | `generate_gold.py` | Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. | | |
| | `gold_qa.json` | 96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`. | | |
| | `run.py` | Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice β regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain β non-circular). Falls through to NIM Llama-4 Maverick β OpenRouter Qwen 80B `:free` β NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::<elected_model>` / `brain_main::graceful_error`. | | |
| | `results.json` | Machine-readable last-run results. | | |
| | `results.md` | Human-readable last-run report β per-question pass/fail, per-category rollup, hallucination breakdown. | | |
| | `info_source_map.json` | Generated by `tools/info_source_map.py`. Claim β URL β verdict (β 798 / β οΈ 321 / β 0 / β³ 1385 as of 2026-05-14). The canonical source-grounding KPI. | | |
| | `verified_urls.json` | HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`. | | |
| | `reviews_url_verification.json` | URL-validation output for `40-data/reviews/<insurer>.json`. | | |
| | `chunk_sweep_results.json`, `chunk_diagnostic.json` | Outputs of `tools/chunk_sweep.py` β chunk-size / overlap grid. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). | | |
| ## Usage | |
| ```bash | |
| # Full 96-Q eval | |
| python -m eval.run | |
| # Smoke | |
| python -m eval.run --limit 30 | |
| # Scoped to one policy | |
| python -m eval.run --policy hdfc-ergo__optima-secure | |
| # Regenerate the gold set after curation refresh | |
| python -m eval.generate_gold | |
| ``` | |
| ## Grading | |
| | Layer | What it checks | Output field | | |
| | --- | --- | --- | | |
| | Regex hard-facts | Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`. | `regex_pass` | | |
| | LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) | Reply *semantically* answers the question and cites the expected source. Different family from the Gemini brain β non-circular. | `judge_pass` | | |
| | Faithfulness gate | The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked. | `blocked` | | |
| ## Acceptance bar | |
| Per `80-audit/ENTERPRISE_AUDIT.md` D-003: **β₯ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment.** Current state lives in `results.md`. | |
| ## Related | |
| - [`backend/faithfulness.py`](../backend/faithfulness.py) β the 4-gate guard that produces the `blocked` verdict | |
| - [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) Β§ "Live bot replies" β provenance trail for any flagged reply | |
| - `tools/info_source_map.py` β generator of `info_source_map.json` | |
| - [ADR-014](../70-docs/60-decisions/ADR-014-groq-llama-grader.md) (superseded β but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini) | |
| - [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md) β current chain shape (Google β NIM β OpenRouter) | |