feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation 3407aff Nomearod Claude Opus 4.6 (1M context) commited on Mar 24
fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer 520796c Nomearod Claude Opus 4.6 (1M context) commited on Mar 24
fix: retrieval metrics use ranked sources, LLM judge wired, report complete 3d027cb Nomearod Claude Opus 4.6 (1M context) commited on Mar 24
feat: Day 7 — evaluation harness, metrics, report, expanded golden dataset c378584 Nomearod Claude Opus 4.6 (1M context) commited on Mar 24