Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Spaces:
Nomearod
/
agentbench
like
0
Running
App
Files
Files
Community
Fetching metadata from the HF Docker repository...
main
agentbench
/
results
1.67 MB
Ctrl+K
Ctrl+K
4 contributors
History:
9 commits
Nomearod
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
ab0e054
14 days ago
calibration_v1_judge_baseline.json
Safe
133 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_baseline_no_abstain.json
Safe
131 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_baseline_no_anchors.json
Safe
124 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_baseline_no_cot.json
Safe
95 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_jury_kappa_weighted.json
Safe
41.1 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_jury_kappa_weighted_members.jsonl
Safe
200 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_jury_kappa_weighted_v1_1.json
Safe
42.5 kB
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
14 days ago
calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl
Safe
207 kB
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause
14 days ago
calibration_v1_judge_permute.json
Safe
40.7 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_judge_permute_members.jsonl
Safe
240 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
calibration_v1_system_outputs.json
Safe
230 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
15 days ago
comparison_custom_vs_langchain.md
Safe
3.07 kB
fix: comparison framing, mock-specific failure analysis, stale test counts
about 2 months ago
fastapi_legacy_baseline_pinned.json
Safe
43.4 kB
docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3)
about 1 month ago
fastapi_postedit.json
Safe
44.1 kB
docs(eval): Fix 1 counterfactual prompt clause β attempted and reverted
about 1 month ago
fastapi_preedit.json
Safe
39.8 kB
docs(eval): Fix 1 counterfactual prompt clause β attempted and reverted
about 1 month ago
k8s_pilot_threshold_0.015.json
Safe
8.63 kB
feat(eval): K8s first pilot run + flavor-B empirical decisions
about 1 month ago
k8s_pilot_threshold_0.02.json
Safe
8.27 kB
feat(eval): K8s first pilot run + flavor-B empirical decisions
about 1 month ago
k8s_postedit.json
Safe
8.63 kB
docs(eval): Fix 1 counterfactual prompt clause β attempted and reverted
about 1 month ago
k8s_postedit_fix2.json
Safe
8.61 kB
docs(eval): Fix 2 SearchTool query expansion β attempted and reverted
about 1 month ago
k8s_postedit_fix2_merge_v2.json
Safe
9.25 kB
docs(eval): Fix 2 SearchTool query expansion β attempted and reverted
about 1 month ago
k8s_preedit.json
Safe
8.55 kB
feat(eval): K8s refusal_threshold 0.02 β 0.015 empirical calibration
about 1 month ago
k8s_preedit_pinned.json
Safe
8.57 kB
docs(eval): Fix 1 counterfactual prompt clause β attempted and reverted
about 1 month ago