Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Spaces:
Nomearod
/
agentbench
like
0
Running
App
Files
Files
Community
Fetching metadata from the HF Docker repository...
main
agentbench
/
docs
839 kB
Ctrl+K
Ctrl+K
4 contributors
History:
24 commits
Nomearod
docs(judge): writeup draft v1 β methodology arc + position + v1.2 fix-list
c093a45
about 2 months ago
_generated
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
about 2 months ago
plans
docs(plans): judge-layer v1 implementation plan β 12 phases, ~50 tasks
about 2 months ago
benchmark_report.md
Safe
5.68 kB
feat: real benchmark numbers from OpenAI gpt-4o-mini evaluation
3 months ago
judge-design.md
34.3 kB
docs(judge): writeup draft v1 β methodology arc + position + v1.2 fix-list
about 2 months ago
k8s-local-setup.md
886 Bytes
feat: infrastructure sprint β vLLM/Modal, Helm, Terraform (#8)
3 months ago
langchain_benchmark_anthropic.md
4.33 kB
fix: comparison framing, mock-specific failure analysis, stale test counts
3 months ago
langchain_benchmark_openai.md
4.31 kB
fix: comparison framing, mock-specific failure analysis, stale test counts
3 months ago
provider_comparison.md
5.87 kB
docs: add known limitations and future work for self-hosted benchmark
3 months ago
retrieval_gate.md
Safe
1.23 kB
feat: add reproducible retrieval gate check with committed artifact
3 months ago