agentbench / results

Commit History

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3)
23de799

Nomearod Claude Opus 4.6 (1M context) commited on

docs(eval): Fix 2 SearchTool query expansion β€” attempted and reverted
27c2e17

Nomearod Claude Opus 4.6 (1M context) commited on

docs(eval): Fix 1 counterfactual prompt clause β€” attempted and reverted
213da36

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): K8s refusal_threshold 0.02 β†’ 0.015 empirical calibration
125dac0

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): K8s first pilot run + flavor-B empirical decisions
2439025

Nomearod Claude Opus 4.6 (1M context) commited on

fix: comparison framing, mock-specific failure analysis, stale test counts
a29d68d

Nomearod Claude Opus 4.6 (1M context) commited on

feat: langchain baseline evaluation results (OpenAI + Anthropic)
81ac43f

Nomearod Claude Opus 4.6 (1M context) commited on