Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / docs

Commit History

docs(judge): writeup draft v1 — methodology arc + position + v1.2 fix-list

c093a45

Nomearod Claude Opus 4.7 (1M context) commited on May 6

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause

ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on May 6

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)

e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on May 5

docs(plans): judge-layer v1 implementation plan — 12 phases, ~50 tasks

171022a

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(plans): judge-layer v1 design — supersede continuous-scale judges with discrete-anchored 2-judge jury + κ-validated calibration

44c65d4

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(plan): add Part A OWASP mapping implementation plan

ad918a7

Nomearod Claude Opus 4.6 (1M context) commited on Apr 15

docs(plan): Part A design self-review fixes (LLM02 consistency, anti-padding template, paired-review gate)

cc8331d

Nomearod Claude Opus 4.6 (1M context) commited on Apr 15

docs(plan): add Part A OWASP LLM Top 10 (2025) mapping design

7c08d23

Nomearod Claude Opus 4.6 (1M context) commited on Apr 15

docs: multi-corpus refactor implementation plan

a5fc1f3

Nomearod Claude Opus 4.6 (1M context) commited on Apr 12

docs: multi-corpus refactor design

31f0ada

Nomearod Claude Opus 4.6 (1M context) commited on Apr 12

style: fix ruff lint — import sorting, line length

12a17f8

Nomearod Claude Opus 4.6 (1M context) commited on Apr 10

docs: add known limitations and future work for self-hosted benchmark

79e4ae8

Nomearod Claude Opus 4.6 (1M context) commited on Mar 31

docs: deepen self-hosted analysis in provider comparison

04cb97f

Nomearod Claude Opus 4.6 (1M context) commited on Mar 31

feat: infrastructure sprint — vLLM/Modal, Helm, Terraform (#8)

a9d4375

Jane Yeung Claude Opus 4.6 (1M context) commited on Mar 31

fix: comparison framing, mock-specific failure analysis, stale test counts

a29d68d

Nomearod Claude Opus 4.6 (1M context) commited on Mar 27

feat: langchain baseline evaluation results (OpenAI + Anthropic)

81ac43f

Nomearod Claude Opus 4.6 (1M context) commited on Mar 27

fix: remove stale V1 docs, update DECISIONS.md for V2

dc97d8c

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

docs: add provider comparison report (OpenAI vs Anthropic Haiku)

3e490c9

Nomearod Claude Opus 4.6 (1M context) commited on Mar 25

feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation

3407aff

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer

520796c

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

fix: retrieval metrics use ranked sources, LLM judge wired, report complete

3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

feat: Day 7 — evaluation harness, metrics, report, expanded golden dataset

c378584

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

feat: add reproducible retrieval gate check with committed artifact

f0bfb5e

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24

Add design document for agent-bench V1

a69ad63

Nomearod Claude Opus 4.6 (1M context) commited on Mar 24