Spaces:
Sleeping
Sleeping
eval/ β Gold-QA harness
Numerical accuracy + grounding eval. Walks a fixed list of curated questions through the orchestrator in-process (fast, no HTTP) and grades each reply with both a regex hard-facts checker and an LLM-judge of a different family from the brain.
eval/ is the objective quality bar; tools/audit/ is the behavioural stress test. Both feed the readiness register in 80-audit/ENTERPRISE_AUDIT.md.
Files
| File | Role |
|---|---|
generate_gold.py |
Builds gold_qa.json from 40-data/policy_facts/: for each curated field with a verbatim quote, emits a natural-language question + expected answer + expected citation. |
gold_qa.json |
96-Q gold set. Each entry: {policy_id, question, expected_answer, expected_regex, source_quote, source_pdf}. |
run.py |
Runner. For each pair: calls backend.orchestrator.handle_turn with policy_filter_ids=[pair.policy_id] so retrieval is scoped to the policy under test. Grades each reply twice β regex hard-facts + JUDGE_CHAIN (NIM Mistral Large 3 675B primary; different family from the Gemini brain β non-circular). Falls through to NIM Llama-4 Maverick β OpenRouter Qwen 80B :free β NIM Nemotron 49B at the tail. Emits brain_used like sales_brain::continue / sales_brain::complete / brain_main::<elected_model> / brain_main::graceful_error. |
results.json |
Machine-readable last-run results. |
results.md |
Human-readable last-run report β per-question pass/fail, per-category rollup, hallucination breakdown. |
info_source_map.json |
Generated by tools/info_source_map.py. Claim β URL β verdict (β
798 / β οΈ 321 / β 0 / β³ 1385 as of 2026-05-14). The canonical source-grounding KPI. |
verified_urls.json |
HEAD-check verdict on every URL in the corpus / facts. Generated by tools/verify_urls.py. |
reviews_url_verification.json |
URL-validation output for 40-data/reviews/<insurer>.json. |
chunk_sweep_results.json, chunk_diagnostic.json |
Outputs of tools/chunk_sweep.py β chunk-size / overlap grid. See ADR-018. |
Usage
# Full 96-Q eval
python -m eval.run
# Smoke
python -m eval.run --limit 30
# Scoped to one policy
python -m eval.run --policy hdfc-ergo__optima-secure
# Regenerate the gold set after curation refresh
python -m eval.generate_gold
Grading
| Layer | What it checks | Output field |
|---|---|---|
| Regex hard-facts | Expected numeric / boolean / waiting-period value appears verbatim in reply_text. |
regex_pass |
LLM judge (JUDGE_CHAIN: NIM Mistral Large 3 675B primary; NIM Llama-4 Maverick / OR Qwen 80B :free / NIM Nemotron 49B fallbacks per ADR-040) |
Reply semantically answers the question and cites the expected source. Different family from the Gemini brain β non-circular. | judge_pass |
| Faithfulness gate | The 4-gate guard in backend/faithfulness.py already ran during handle_turn; eval records whether it blocked. |
blocked |
Acceptance bar
Per 80-audit/ENTERPRISE_AUDIT.md D-003: β₯ 90% on the 96-Q gold-QA set is the P0 gate for enterprise deployment. Current state lives in results.md.
Related
backend/faithfulness.pyβ the 4-gate guard that produces theblockedverdictkb/AUDIT_TRAIL.mdΒ§ "Live bot replies" β provenance trail for any flagged replytools/info_source_map.pyβ generator ofinfo_source_map.json- ADR-014 (superseded β but the cross-family judge principle survives; current judge is NIM Mistral 675B, brain is Gemini)
- ADR-040 β current chain shape (Google β NIM β OpenRouter)