# `eval/` — Gold-QA harness

Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.

`eval/` is the **objective** quality bar; `tools/audit/` is the **behavioural** stress test. Both feed the readiness register in `80-audit/ENTERPRISE_AUDIT.md`.

## Files

| File | Role |
| --- | --- |
| `generate_gold.py` | Builds `gold_qa.json` from `40-data/policy_facts/`: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. |
| `gold_qa.json` | 96-Q gold set. Each entry: `{policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}`. |
| `run.py` | Runner. For each pair: calls `backend.orchestrator.handle_turn` with `policy_filter_ids=[pair.policy_id]` so retrieval is scoped to the policy under test. Grades each reply twice — regex hard-facts + `JUDGE_CHAIN` (NIM Mistral Large 3 675B primary; different family from the Gemini brain → non-circular). Falls through to NIM Llama-4 Maverick → OpenRouter Qwen 80B `:free` → NIM Nemotron 49B at the tail. Emits `brain_used` like `sales_brain::continue` / `sales_brain::complete` / `brain_main::<elected_model>` / `brain_main::graceful_error`. |
| `results.json` | Machine-readable last-run results. |
| `results.md` | Human-readable last-run report — per-question pass/fail, per-category rollup, hallucination breakdown. |
| `info_source_map.json` | Generated by `tools/info_source_map.py`. Claim → URL → verdict (✅ 798 / ⚠️ 321 / ❌ 0 / ⏳ 1385 as of 2026-05-14). The canonical source-grounding KPI. |
| `verified_urls.json` | HEAD-check verdict on every URL in the corpus / facts. Generated by `tools/verify_urls.py`. |
| `reviews_url_verification.json` | URL-validation output for `40-data/reviews/<insurer>.json`. |
| `chunk_sweep_results.json`, `chunk_diagnostic.json` | Outputs of `tools/chunk_sweep.py` — chunk-size / overlap grid. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). |

## Usage

```bash
# Full 96-Q eval
python -m eval.run

# Smoke
python -m eval.run --limit 30

# Scoped to one policy
python -m eval.run --policy hdfc-ergo__optima-secure

# Regenerate the gold set after curation refresh
python -m eval.generate_gold
```

## Grading

| Layer | What it checks | Output field |
| --- | --- | --- |
| Regex hard-facts | Expected numeric / boolean / waiting-period value appears verbatim in `reply_text`. | `regex_pass` |
| LLM judge (`JUDGE_CHAIN`: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B `:free` / NIM Nemotron 49B fallbacks per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)) | Reply *semantically* answers the question and cites the expected source. Different family from the Gemini brain → non-circular. | `judge_pass` |
| Faithfulness gate | The 4-gate guard in `backend/faithfulness.py` already ran during `handle_turn`; eval records whether it blocked. | `blocked` |

## Acceptance bar

Per `80-audit/ENTERPRISE_AUDIT.md` D-003: **≥ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment.** Current state lives in `results.md`.

## Related

- [`backend/faithfulness.py`](../backend/faithfulness.py) — the 4-gate guard that produces the `blocked` verdict
- [`kb/AUDIT_TRAIL.md`](../kb/AUDIT_TRAIL.md) § "Live bot replies" — provenance trail for any flagged reply
- `tools/info_source_map.py` — generator of `info_source_map.json`
- [ADR-014](../70-docs/60-decisions/ADR-014-groq-llama-grader.md) (superseded — but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
- [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md) — current chain shape (Google → NIM → OpenRouter)