Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / results

1.67 MB

Ctrl+K

4 contributors

History: 9 commits

Nomearod

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause

ab0e054 14 days ago

calibration_v1_judge_baseline.json

133 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_baseline_no_abstain.json

131 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_baseline_no_anchors.json

124 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_baseline_no_cot.json

95 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_jury_kappa_weighted.json

41.1 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_jury_kappa_weighted_members.jsonl

200 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_jury_kappa_weighted_v1_1.json

42.5 kB
calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause 14 days ago
calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl

207 kB
calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause 14 days ago
calibration_v1_judge_permute.json

40.7 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_judge_permute_members.jsonl

240 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
calibration_v1_system_outputs.json

230 kB
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) 15 days ago
comparison_custom_vs_langchain.md

3.07 kB
fix: comparison framing, mock-specific failure analysis, stale test counts about 2 months ago
fastapi_legacy_baseline_pinned.json

43.4 kB
docs: Phase 1 gate closure + stale-wording corrections (cross-cutting #3) about 1 month ago
fastapi_postedit.json

44.1 kB
docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted about 1 month ago
fastapi_preedit.json

39.8 kB
docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted about 1 month ago
k8s_pilot_threshold_0.015.json

8.63 kB
feat(eval): K8s first pilot run + flavor-B empirical decisions about 1 month ago
k8s_pilot_threshold_0.02.json

8.27 kB
feat(eval): K8s first pilot run + flavor-B empirical decisions about 1 month ago
k8s_postedit.json

8.63 kB
docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted about 1 month ago
k8s_postedit_fix2.json

8.61 kB
docs(eval): Fix 2 SearchTool query expansion — attempted and reverted about 1 month ago
k8s_postedit_fix2_merge_v2.json

9.25 kB
docs(eval): Fix 2 SearchTool query expansion — attempted and reverted about 1 month ago
k8s_preedit.json

8.55 kB
feat(eval): K8s refusal_threshold 0.02 → 0.015 empirical calibration about 1 month ago
k8s_preedit_pinned.json

8.57 kB
docs(eval): Fix 1 counterfactual prompt clause — attempted and reverted about 1 month ago