agentbench / tests

Commit History

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

fix(calibration): per-corpus dispatch in generate-outputs (#19)
ee729e0
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

fix(judges,calibration,harness): three Codex adversarial-review findings
226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)
71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges): four review-blocking bugs (review items 1–4 + 8)
9255fb5

Nomearod Claude Opus 4.7 (1M context) commited on

refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness)
e76227f

Nomearod Claude Opus 4.7 (1M context) commited on

feat(config): add evaluation.judge_dimensions field
12cb8b7

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): generate_kappa_table with strict/warn modes
1d47106

Nomearod Claude Opus 4.7 (1M context) commited on

feat(scripts): run_calibration.py orchestrator for Steps A/C/D
4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on

test(calibration): sklearn-parity fixtures + cross-check CI test
3a2ed35

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci
6ef2e0e

Nomearod Claude Opus 4.7 (1M context) commited on

feat(variance): PermutedJudge + Jury — N permutations and multi-judge aggregator
c038a7d

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation
04d9ea0

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CompletenessJudge + three-point reference-based rubric
80be2d8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): RelevanceJudge + anchored three-point rubric
b170eb6

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): GroundednessJudge + anchored binary rubric
30a5e0c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain
ff78845

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): MockJudge with LookupError on missing keys
aa70e89

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Judge ABC with judge_id derived from model + dimension
2192305

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Rubric markdown loader with aggressive validation
7b72b2c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): ScoreResult + abstain-reason constants
76e370c

Nomearod Claude Opus 4.7 (1M context) commited on

test: scaffold tests/evaluation/ directory for judge-layer tests
f94cea7

Nomearod Claude Opus 4.7 (1M context) commited on

docs+test: round-2 incident response — Google API key format scrub
4dc3e01

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): Week 1 step 5 — 25-question K8s golden dataset + grounded_refusal fix
4454894

Nomearod Claude Opus 4.6 (1M context) commited on

docs(eval): Fix 2 SearchTool query expansion — attempted and reverted
27c2e17

Nomearod Claude Opus 4.6 (1M context) commited on

feat: extend GoldenQuestion with source_pages and source_sections
d5884af

Nomearod Claude Opus 4.6 (1M context) commited on

fix: batch-3 adversarial review findings
42c7303

Nomearod commited on

refactor: address batch-2 review feedback
1bf7f2d

Nomearod commited on

feat: parameterized system prompt template
f56d519

Nomearod commited on

feat: SSE meta event carries corpus + corpus_label
116d6ee

Nomearod commited on

feat: per-request corpus routing via Literal validation
6456833

Nomearod commited on

security: fail-closed on secret extraction and env var leakage
6ca375c

Nomearod commited on

refactor: address batch-1 review feedback
f717b74

Nomearod commited on

feat: support multi-corpus golden dataset schema
83d6b2b

Nomearod commited on

feat: multi-corpus construction with nested corpus×provider composition
4ec6632

Nomearod commited on

feat: add CorpusConfig for multi-corpus support
4fb5dcb

Nomearod commited on

fix: stream stage events live, thread source_chunks, fix LangChain wrapper
77c4ed4

Nomearod Claude Opus 4.6 (1M context) commited on

style: fix ruff lint — import sorting, line length
12a17f8

Nomearod Claude Opus 4.6 (1M context) commited on

test: update security integration mock for _orchestrator_done event
148a231

Nomearod Claude Opus 4.6 (1M context) commited on

feat: route handler emits meta, injection, output_validation SSE events
6a07150

Nomearod Claude Opus 4.6 (1M context) commited on

fix: address remaining review issues (3/5)
ffea8ea

Nomearod Claude Opus 4.6 (1M context) commited on

feat: orchestrator.run_stream emits per-stage SSE events
d0f0142

Nomearod Claude Opus 4.6 (1M context) commited on

feat: enrich SearchTool metadata with scores, previews, PII count
1d16fd9

Nomearod Claude Opus 4.6 (1M context) commited on

feat: expose reranker scores through retrieval pipeline
c5573d3

Nomearod Claude Opus 4.6 (1M context) commited on

fix: ruff lint — import sorting, unused imports, line length, naming
ecb7080

Nomearod Claude Opus 4.6 (1M context) commited on

fix(security): buffer stream for output validation, store filtered answer
06bc29e

Nomearod Claude Opus 4.6 (1M context) commited on

fix(security): output validation on /ask/stream, correct audit endpoint
02f7f66

Nomearod Claude Opus 4.6 (1M context) commited on

fix(security): add security to /ask/stream, wire PII redactor into SearchTool
14985f8

Nomearod Claude Opus 4.6 (1M context) commited on

feat(security): wire injection detection, output validation, audit into pipeline
cebf463

Nomearod Claude Opus 4.6 (1M context) commited on

fix(security): strip punctuation before slashes in URL normalization
7d3f664

Nomearod Claude Opus 4.6 (1M context) commited on