Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Spaces:
Nomearod
/
agentbench
like
0
Running
App
Files
Files
Community
Fetching metadata from the HF Docker repository...
main
agentbench
/
measurements
123 kB
Ctrl+K
Ctrl+K
4 contributors
History:
5 commits
Nomearod
calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
504a35c
25 days ago
2026-04-15-coldstart-n1.log
2.16 kB
docs: cold-start measurement + falsified-assumption finding + v1.1 contingency
about 2 months ago
2026-04-15-coldstart-n2.log
2.14 kB
docs: cold-start measurement + falsified-assumption finding + v1.1 contingency
about 2 months ago
2026-04-15-coldstart-n3.log
2.17 kB
docs: cold-start measurement + falsified-assumption finding + v1.1 contingency
about 2 months ago
2026-05-04-judge-calibration-labels.jsonl
26.4 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
26 days ago
2026-05-05-judge-rubric-opus-stress.jsonl
72.4 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
26 days ago
2026-05-06-3a-paraphrase-recency-probe.jsonl
4.11 kB
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
25 days ago
2026-05-06-4a-gpt4o-full-probe.jsonl
5.1 kB
calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
25 days ago
2026-05-06-gpt4o-extraction-reasoning-split.md
7.66 kB
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
25 days ago
README.md
1.05 kB
docs+build: judge-layer v1 coupled-artifact updates
27 days ago