agentbench / docs

Commit History

docs(judge): writeup draft v1 β€” methodology arc + position + v1.2 fix-list
c093a45

Nomearod Claude Opus 4.7 (1M context) commited on

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

docs(plans): judge-layer v1 implementation plan β€” 12 phases, ~50 tasks
171022a

Nomearod Claude Opus 4.7 (1M context) commited on

docs(plans): judge-layer v1 design β€” supersede continuous-scale judges with discrete-anchored 2-judge jury + ΞΊ-validated calibration
44c65d4

Nomearod Claude Opus 4.7 (1M context) commited on

docs(plan): add Part A OWASP mapping implementation plan
ad918a7

Nomearod Claude Opus 4.6 (1M context) commited on

docs(plan): Part A design self-review fixes (LLM02 consistency, anti-padding template, paired-review gate)
cc8331d

Nomearod Claude Opus 4.6 (1M context) commited on

docs(plan): add Part A OWASP LLM Top 10 (2025) mapping design
7c08d23

Nomearod Claude Opus 4.6 (1M context) commited on

docs: multi-corpus refactor implementation plan
a5fc1f3

Nomearod Claude Opus 4.6 (1M context) commited on

docs: multi-corpus refactor design
31f0ada

Nomearod Claude Opus 4.6 (1M context) commited on

style: fix ruff lint β€” import sorting, line length
12a17f8

Nomearod Claude Opus 4.6 (1M context) commited on

docs: add known limitations and future work for self-hosted benchmark
79e4ae8

Nomearod Claude Opus 4.6 (1M context) commited on

docs: deepen self-hosted analysis in provider comparison
04cb97f

Nomearod Claude Opus 4.6 (1M context) commited on

feat: infrastructure sprint β€” vLLM/Modal, Helm, Terraform (#8)
a9d4375

Jane Yeung Claude Opus 4.6 (1M context) commited on

fix: comparison framing, mock-specific failure analysis, stale test counts
a29d68d

Nomearod Claude Opus 4.6 (1M context) commited on

feat: langchain baseline evaluation results (OpenAI + Anthropic)
81ac43f

Nomearod Claude Opus 4.6 (1M context) commited on

fix: remove stale V1 docs, update DECISIONS.md for V2
dc97d8c

Nomearod Claude Opus 4.6 (1M context) commited on

docs: add provider comparison report (OpenAI vs Anthropic Haiku)
3e490c9

Nomearod Claude Opus 4.6 (1M context) commited on

feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation
3407aff

Nomearod Claude Opus 4.6 (1M context) commited on

fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer
520796c

Nomearod Claude Opus 4.6 (1M context) commited on

fix: retrieval metrics use ranked sources, LLM judge wired, report complete
3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 7 β€” evaluation harness, metrics, report, expanded golden dataset
c378584

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add reproducible retrieval gate check with committed artifact
f0bfb5e

Nomearod Claude Opus 4.6 (1M context) commited on

Add design document for agent-bench V1
a69ad63

Nomearod Claude Opus 4.6 (1M context) commited on