Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / docs

839 kB

Ctrl+K

Ctrl+K

4 contributors

History: 24 commits

Nomearod's picture

docs(judge): writeup draft v1 — methodology arc + position + v1.2 fix-list

c093a45 about 2 months ago

_generated
calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause about 2 months ago
plans
docs(plans): judge-layer v1 implementation plan — 12 phases, ~50 tasks about 2 months ago
benchmark_report.md

5.68 kB
feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation 3 months ago
judge-design.md

34.3 kB
docs(judge): writeup draft v1 — methodology arc + position + v1.2 fix-list about 2 months ago
k8s-local-setup.md

886 Bytes
feat: infrastructure sprint — vLLM/Modal, Helm, Terraform (#8) 3 months ago
langchain_benchmark_anthropic.md

4.33 kB
fix: comparison framing, mock-specific failure analysis, stale test counts 3 months ago
langchain_benchmark_openai.md

4.31 kB
fix: comparison framing, mock-specific failure analysis, stale test counts 3 months ago
provider_comparison.md

5.87 kB
docs: add known limitations and future work for self-hosted benchmark 3 months ago
retrieval_gate.md

1.23 kB
feat: add reproducible retrieval gate check with committed artifact 3 months ago