Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / DECISIONS.md

Commit History

calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific

504a35c

Nomearod Claude Opus 4.7 (1M context) commited on 21 days ago

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause

ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on 21 days ago

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)

e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on 22 days ago

docs+build: judge-layer v1 coupled-artifact updates

508e5ef

Nomearod Claude Opus 4.7 (1M context) commited on 23 days ago

docs(decisions): promote cold-start falsified-assumption and audit-path incident entries, add three-regimes latency refinement

6409a40

Nomearod Claude Opus 4.7 (1M context) commited on Apr 23

docs(decisions): add entry on named residual risks and scope limits verdict discipline

2e8274b

Nomearod Claude Opus 4.6 (1M context) commited on Apr 15

docs+test: round-2 incident response — Google API key format scrub

4dc3e01

Nomearod Claude Opus 4.6 (1M context) commited on Apr 15

docs: incident response + SHA remap after credential-exposure history rewrite

168d3e1

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs: defer HF Space rename — outstanding applications reference current URL

5d4b3fe

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs: step 8.1 — tagline reframe + README honest-scope + rename closure

086ad86

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): K8s refusal_threshold sweep against 25Q set — 0.015 validated

2d1d822

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs: step 5 follow-up — parallel-tracks list + post-authoring observations

05bf702

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): Week 1 step 5 — 25-question K8s golden dataset + grounded_refusal fix

4454894

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3)

23de799

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs(eval): Fix 2 SearchTool query expansion — attempted and reverted

27c2e17

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted

213da36

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

chore(eval): pin gpt-4o-mini snapshot + wire fastapi golden_dataset + pre-commit tolerances

5c1f49f

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): K8s refusal_threshold 0.02 → 0.015 empirical calibration

125dac0

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

feat(eval): K8s first pilot run + flavor-B empirical decisions

2439025

Nomearod Claude Opus 4.6 (1M context) commited on Apr 13

docs: decisions for multi-corpus refactor

361d65d

Nomearod commited on Apr 12

docs: add decisions for monitor mode, SSE events, vanilla JS

77e1875

Nomearod Claude Opus 4.6 (1M context) commited on Apr 10

docs: fix test count in Testing section, add auth decision, reorder entries

9a8ca07

Nomearod Claude Opus 4.6 (1M context) commited on Mar 31

docs: add security architecture section to README and DECISIONS.md

f7bb777

Nomearod Claude Opus 4.6 (1M context) commited on Mar 31

feat: infrastructure sprint — vLLM/Modal, Helm, Terraform (#8)

a9d4375

Jane Yeung Claude Opus 4.6 (1M context) commited on Mar 31

fix: remove stale V1 docs, update DECISIONS.md for V2

dc97d8c

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: implement Anthropic Claude provider

077b821

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: add SQLite conversation sessions with session_id

9874438

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: add provider retry with backoff and API rate limiting

871820a

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: add cross-encoder reranking with feature flag

65d5480

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: add grounded refusal gate based on retrieval score threshold

c410788

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: Day 8 — README with architecture, API docs, eval guide + DECISIONS.md

7920a16

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24