# `eval/` — Gold-QA harness Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain. `eval/` is the **objective** quality bar; `tools/audit/` is the **behavioural** stress test. Both feed the readiness register in `80-audit/ENTERPRISE_AUDIT.md`. ## Files | File | Role | | --- | --- | | `generate_gold.py` | Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. | | `gold_qa.json` | 96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`. | | `run.py` | Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice — regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain → non-circular). Falls through to NIM Llama-4 Maverick → OpenRouter Qwen 80B `:free` → NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::` / `brain_main::graceful_error`. | | `results.json` | Machine-readable last-run results. | | `results.md` | Human-readable last-run report — per-question pass/fail, per-category rollup, hallucination breakdown. | | `info_source_map.json` | Generated by `tools/info_source_map.py`. Claim → URL → verdict (✅ 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI. | | `verified_urls.json` | HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`. | | `reviews_url_verification.json` | URL-validation output for `40-data/reviews/.json`. | | `chunk_sweep_results.json`, `chunk_diagnostic.json` | Outputs of `tools/chunk_sweep.py` — chunk-size / overlap grid. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). | ## Usage ```bash # Full 96-Q eval python -m eval.run # Smoke python -m eval.run --limit 30 # Scoped to one policy python -m eval.run --policy hdfc-ergo__optima-secure # Regenerate the gold set after curation refresh python -m eval.generate_gold ``` ## Grading | Layer | What it checks | Output field | | --- | --- | --- | | Regex hard-facts | Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`. | `regex_pass` | | LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) | Reply *semantically* answers the question and cites the expected source. Different family from the Gemini brain → non-circular. | `judge_pass` | | Faithfulness gate | The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked. | `blocked` | ## Acceptance bar Per `80-audit/ENTERPRISE_AUDIT.md` D-003: **≥ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment.** Current state lives in `results.md`. ## Related - [`backend/faithfulness.py`](../backend/faithfulness.py) — the 4-gate guard that produces the `blocked` verdict - [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) § "Live bot replies" — provenance trail for any flagged reply - `tools/info_source_map.py` — generator of `info_source_map.json` - [ADR-014](../70-docs/60-decisions/ADR-014-groq-llama-grader.md) (superseded — but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini) - [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md) — current chain shape (Google → NIM → OpenRouter)