agentbench / tests /test_evaluation.py

Commit History

feat(eval): Week 1 step 5 — 25-question K8s golden dataset + grounded_refusal fix
4454894

Nomearod Claude Opus 4.6 (1M context) commited on

fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer
520796c

Nomearod Claude Opus 4.6 (1M context) commited on

fix: retrieval metrics use ranked sources, LLM judge wired, report complete
3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on