Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / results

Commit History

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause

ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on 27 days ago

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)

e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on 28 days ago

docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3)

23de799

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs(eval): Fix 2 SearchTool query expansion — attempted and reverted

27c2e17

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted

213da36

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): K8s refusal_threshold 0.02 → 0.015 empirical calibration

125dac0

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): K8s first pilot run + flavor-B empirical decisions

2439025

Nomearod Claude Opus 4.6 (1M context) commited on Apr 13

fix: comparison framing, mock-specific failure analysis, stale test counts

a29d68d

Nomearod Claude Opus 4.6 (1M context) commited on Mar 27

feat: langchain baseline evaluation results (OpenAI + Anthropic)

81ac43f

Nomearod Claude Opus 4.6 (1M context) commited on Mar 27