fix(types): four mypy errors blocking CI 02b8717 Nomearod Claude Opus 4.7 (1M context) commited on 29 days ago
refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness) e76227f Nomearod Claude Opus 4.7 (1M context) commited on 29 days ago
fix: comparison framing, mock-specific failure analysis, stale test counts a29d68d Nomearod Claude Opus 4.6 (1M context) commited on Mar 27
fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer 520796c Nomearod Claude Opus 4.6 (1M context) commited on Mar 24
fix: retrieval metrics use ranked sources, LLM judge wired, report complete 3d027cb Nomearod Claude Opus 4.6 (1M context) commited on Mar 24
feat: Day 7 — evaluation harness, metrics, report, expanded golden dataset c378584 Nomearod Claude Opus 4.6 (1M context) commited on Mar 24