agentbench / agent_bench /evaluation

Commit History

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

fix(types): four mypy errors blocking CI
02b8717

Nomearod Claude Opus 4.7 (1M context) commited on

docs(harness,readme): two re-review must-fix items
c39d5c7

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration,harness): three Codex adversarial-review findings
226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)
71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges): four review-blocking bugs (review items 1–4 + 8)
9255fb5

Nomearod Claude Opus 4.7 (1M context) commited on

refactor(metrics): delete superseded LLM judges (answer_faithfulness etc.)
281b43d

Nomearod Claude Opus 4.7 (1M context) commited on

refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness)
e76227f

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): generate_kappa_table with strict/warn modes
1d47106

Nomearod Claude Opus 4.7 (1M context) commited on

feat(scripts): run_calibration.py orchestrator for Steps A/C/D
4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on

feat(goldens): add source_snippets to 8 FastAPI calibration items
a48afb9

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): 30-item stratified calibration_v1 sample
8ef480a

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci
6ef2e0e

Nomearod Claude Opus 4.7 (1M context) commited on

feat(variance): PermutedJudge + Jury β€” N permutations and multi-judge aggregator
c038a7d

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation
04d9ea0

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CompletenessJudge + three-point reference-based rubric
80be2d8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): RelevanceJudge + anchored three-point rubric
b170eb6

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): GroundednessJudge + anchored binary rubric
30a5e0c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain
ff78845

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): MockJudge with LookupError on missing keys
aa70e89

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Judge ABC with judge_id derived from model + dimension
2192305

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Rubric markdown loader with aggressive validation
7b72b2c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): ScoreResult + abstain-reason constants
76e370c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(eval): Week 1 step 5 β€” 25-question K8s golden dataset + grounded_refusal fix
4454894

Nomearod Claude Opus 4.6 (1M context) commited on

feat: K8s pilot corpus β€” 8 pages + config entry + JSON rewrite
ce7247c

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add 6-question K8s golden pilot dataset
3484214

Nomearod Claude Opus 4.6 (1M context) commited on

feat: extend GoldenQuestion with source_pages and source_sections
d5884af

Nomearod Claude Opus 4.6 (1M context) commited on

feat: support multi-corpus golden dataset schema
83d6b2b

Nomearod commited on

fix: comparison framing, mock-specific failure analysis, stale test counts
a29d68d

Nomearod Claude Opus 4.6 (1M context) commited on

fix: grounded refusal checks no-sources, reference_answer for judge, mock disclaimer
520796c

Nomearod Claude Opus 4.6 (1M context) commited on

fix: retrieval metrics use ranked sources, LLM judge wired, report complete
3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 7 β€” evaluation harness, metrics, report, expanded golden dataset
c378584

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 4 β€” corpus, ingest script, first 10 golden questions
a152b95

Nomearod Claude Opus 4.6 (1M context) commited on