agentbench / tests /evaluation

Commit History

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration,harness): three Codex adversarial-review findings
226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)
71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges): four review-blocking bugs (review items 1–4 + 8)
9255fb5

Nomearod Claude Opus 4.7 (1M context) commited on

feat(config): add evaluation.judge_dimensions field
12cb8b7

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): generate_kappa_table with strict/warn modes
1d47106

Nomearod Claude Opus 4.7 (1M context) commited on

feat(scripts): run_calibration.py orchestrator for Steps A/C/D
4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on

test(calibration): sklearn-parity fixtures + cross-check CI test
3a2ed35

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci
6ef2e0e

Nomearod Claude Opus 4.7 (1M context) commited on

feat(variance): PermutedJudge + Jury β€” N permutations and multi-judge aggregator
c038a7d

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation
04d9ea0

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CompletenessJudge + three-point reference-based rubric
80be2d8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): RelevanceJudge + anchored three-point rubric
b170eb6

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): GroundednessJudge + anchored binary rubric
30a5e0c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain
ff78845

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): MockJudge with LookupError on missing keys
aa70e89

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Judge ABC with judge_id derived from model + dimension
2192305

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Rubric markdown loader with aggressive validation
7b72b2c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): ScoreResult + abstain-reason constants
76e370c

Nomearod Claude Opus 4.7 (1M context) commited on

test: scaffold tests/evaluation/ directory for judge-layer tests
f94cea7

Nomearod Claude Opus 4.7 (1M context) commited on