Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
Spaces:
Nomearod
/
agentbench
like
0
Running
App
Files
Files
Community
Fetching metadata from the HF Docker repository...
main
agentbench
/
docs
839 kB
Ctrl+K
Ctrl+K
4 contributors
History:
24 commits
Nomearod
docs(judge): writeup draft v1 β methodology arc + position + v1.2 fix-list
c093a45
1 day ago
_generated
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
1 day ago
plans
docs(plans): judge-layer v1 implementation plan β 12 phases, ~50 tasks
3 days ago
benchmark_report.md
Safe
5.68 kB
feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation
about 1 month ago
judge-design.md
34.3 kB
docs(judge): writeup draft v1 β methodology arc + position + v1.2 fix-list
1 day ago
k8s-local-setup.md
886 Bytes
feat: infrastructure sprint β vLLM/Modal, Helm, Terraform (#8)
about 1 month ago
langchain_benchmark_anthropic.md
4.33 kB
fix: comparison framing, mock-specific failure analysis, stale test counts
about 1 month ago
langchain_benchmark_openai.md
4.31 kB
fix: comparison framing, mock-specific failure analysis, stale test counts
about 1 month ago
provider_comparison.md
5.87 kB
docs: add known limitations and future work for self-hosted benchmark
about 1 month ago
retrieval_gate.md
Safe
1.23 kB
feat: add reproducible retrieval gate check with committed artifact
about 1 month ago