agentbench / scripts /benchmark.py

Commit History

feat: Day 7 — evaluation harness, metrics, report, expanded golden dataset
c378584

Nomearod Claude Opus 4.6 (1M context) commited on