Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / tests /test_evaluation.py

Commit History

feat(eval): Week 1 step 5 — 25-question K8s golden dataset + grounded_refusal fix

4454894

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer

520796c

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

fix: retrieval metrics use ranked sources, LLM judge wired, report complete

3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24