agentbench / scripts

Commit History

calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
504a35c

Nomearod Claude Opus 4.7 (1M context) commited on

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

fix(calibration): per-corpus dispatch in generate-outputs (#19)
ee729e0
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

fix(judges,calibration,harness): three Codex adversarial-review findings
226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)
71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(scripts): run_calibration.py orchestrator for Steps A/C/D
4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): 30-item stratified calibration_v1 sample
8ef480a

Nomearod Claude Opus 4.7 (1M context) commited on

test(calibration): sklearn-parity fixtures + cross-check CI test
3a2ed35

Nomearod Claude Opus 4.7 (1M context) commited on

fix(ingest): exclude QUESTION_PLAN.md from corpus ingestion
9dfd3f0

Nomearod Claude Opus 4.6 (1M context) commited on

feat: evaluate.py --corpus flag + CorpusConfig.golden_dataset
68d96ea

Nomearod Claude Opus 4.6 (1M context) commited on

fix: deferred imports, match iteration budget, token cost tracking
2c64504

Nomearod Claude Opus 4.6 (1M context) commited on

feat: langchain evaluation CLI script and Makefile target
9f98da1

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add cross-encoder reranking with feature flag
65d5480

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add grounded refusal gate based on retrieval score threshold
c410788

Nomearod Claude Opus 4.6 (1M context) commited on

fix: retrieval metrics use ranked sources, LLM judge wired, report complete
3d027cb

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 7 β€” evaluation harness, metrics, report, expanded golden dataset
c378584

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add reproducible retrieval gate check with committed artifact
f0bfb5e

Nomearod Claude Opus 4.6 (1M context) commited on

feat: Day 4 β€” corpus, ingest script, first 10 golden questions
a152b95

Nomearod Claude Opus 4.6 (1M context) commited on