agentbench / DECISIONS.md

Commit History

calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
504a35c

Nomearod Claude Opus 4.7 (1M context) commited on

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

docs+build: judge-layer v1 coupled-artifact updates
508e5ef

Nomearod Claude Opus 4.7 (1M context) commited on

docs(decisions): promote cold-start falsified-assumption and audit-path incident entries, add three-regimes latency refinement
6409a40

Nomearod Claude Opus 4.7 (1M context) commited on

docs(decisions): add entry on named residual risks and scope limits verdict discipline
2e8274b

Nomearod Claude Opus 4.6 (1M context) commited on

docs+test: round-2 incident response β€” Google API key format scrub
4dc3e01

Nomearod Claude Opus 4.6 (1M context) commited on

docs: incident response + SHA remap after credential-exposure history rewrite
168d3e1

Nomearod Claude Opus 4.6 (1M context) commited on

docs: defer HF Space rename β€” outstanding applications reference current URL
5d4b3fe

Nomearod Claude Opus 4.6 (1M context) commited on

docs: step 8.1 β€” tagline reframe + README honest-scope + rename closure
086ad86

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): K8s refusal_threshold sweep against 25Q set β€” 0.015 validated
2d1d822

Nomearod Claude Opus 4.6 (1M context) commited on

docs: step 5 follow-up β€” parallel-tracks list + post-authoring observations
05bf702

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): Week 1 step 5 β€” 25-question K8s golden dataset + grounded_refusal fix
4454894

Nomearod Claude Opus 4.6 (1M context) commited on

docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3)
23de799

Nomearod Claude Opus 4.6 (1M context) commited on

docs(eval): Fix 2 SearchTool query expansion β€” attempted and reverted
27c2e17

Nomearod Claude Opus 4.6 (1M context) commited on

docs(eval): Fix 1 counterfactual prompt clause β€” attempted and reverted
213da36

Nomearod Claude Opus 4.6 (1M context) commited on

chore(eval): pin gpt-4o-mini snapshot + wire fastapi golden_dataset + pre-commit tolerances
5c1f49f

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): K8s refusal_threshold 0.02 β†’ 0.015 empirical calibration
125dac0

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): K8s first pilot run + flavor-B empirical decisions
2439025

Nomearod Claude Opus 4.6 (1M context) commited on

docs: decisions for multi-corpus refactor
361d65d

Nomearod commited on

docs: add decisions for monitor mode, SSE events, vanilla JS
77e1875

Nomearod Claude Opus 4.6 (1M context) commited on

docs: fix test count in Testing section, add auth decision, reorder entries
9a8ca07

Nomearod Claude Opus 4.6 (1M context) commited on

docs: add security architecture section to README and DECISIONS.md
f7bb777

Nomearod Claude Opus 4.6 (1M context) commited on

feat: infrastructure sprint β€” vLLM/Modal, Helm, Terraform (#8)
a9d4375

Jane Yeung Claude Opus 4.6 (1M context) commited on

fix: remove stale V1 docs, update DECISIONS.md for V2
dc97d8c

Nomearod Claude Opus 4.6 (1M context) commited on

feat: implement Anthropic Claude provider
077b821

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add SQLite conversation sessions with session_id
9874438

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add provider retry with backoff and API rate limiting
871820a

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add cross-encoder reranking with feature flag
65d5480

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add grounded refusal gate based on retrieval score threshold
c410788

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 8 β€” README with architecture, API docs, eval guide + DECISIONS.md
7920a16

Nomearod Claude Opus 4.6 (1M context) commited on