agentbench / agent_bench

Commit History

dashboard: add #harness + #harness-appendix sections (v3 design integration)
2d9ce3a

Nomearod Claude Opus 4.7 (1M context) commited on

calibrate(jury): v1.1+v1.1.1 β€” fix weighting bugs; recency-position paraphrase clause
ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)
e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on

fix(types): four mypy errors blocking CI
02b8717

Nomearod Claude Opus 4.7 (1M context) commited on

docs(harness,readme): two re-review must-fix items
c39d5c7

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration,harness): three Codex adversarial-review findings
226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)
71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on

fix(judges): four review-blocking bugs (review items 1–4 + 8)
9255fb5

Nomearod Claude Opus 4.7 (1M context) commited on

refactor(metrics): delete superseded LLM judges (answer_faithfulness etc.)
281b43d

Nomearod Claude Opus 4.7 (1M context) commited on

refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness)
e76227f

Nomearod Claude Opus 4.7 (1M context) commited on

feat(config): add evaluation.judge_dimensions field
12cb8b7

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): generate_kappa_table with strict/warn modes
1d47106

Nomearod Claude Opus 4.7 (1M context) commited on

feat(scripts): run_calibration.py orchestrator for Steps A/C/D
4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on

feat(goldens): add source_snippets to 8 FastAPI calibration items
a48afb9

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): 30-item stratified calibration_v1 sample
8ef480a

Nomearod Claude Opus 4.7 (1M context) commited on

feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci
6ef2e0e

Nomearod Claude Opus 4.7 (1M context) commited on

feat(variance): PermutedJudge + Jury β€” N permutations and multi-judge aggregator
c038a7d

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation
04d9ea0

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): CompletenessJudge + three-point reference-based rubric
80be2d8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): RelevanceJudge + anchored three-point rubric
b170eb6

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): GroundednessJudge + anchored binary rubric
30a5e0c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain
ff78845

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): MockJudge with LookupError on missing keys
aa70e89

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Judge ABC with judge_id derived from model + dimension
2192305

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): Rubric markdown loader with aggressive validation
7b72b2c

Nomearod Claude Opus 4.7 (1M context) commited on

feat(judges): ScoreResult + abstain-reason constants
76e370c

Nomearod Claude Opus 4.7 (1M context) commited on

Redesign landing page: paper+ink visual system, instrumented pipeline, OWASP badges
82b6725

Nomearod Claude Opus 4.7 (1M context) commited on

feat(landing): OWASP coverage subtitle + LLM05 tooltip on corpus chips
414c372

Nomearod Claude Opus 4.7 (1M context) commited on

fix(landing): align OWASP chip tooltips with actual defense paths
a83a8f8

Nomearod Claude Opus 4.7 (1M context) commited on

feat(landing): add OWASP security-demo chips to landing page
37c97c8

Nomearod Claude Opus 4.7 (1M context) commited on

docs(landing): add OWASP mapping subtitle to Security panel
b6374bd

Nomearod Claude Opus 4.6 (1M context) commited on

fix(audit): catch all write errors so audit failures can't crash requests
25e0f1b

Nomearod Claude Opus 4.6 (1M context) commited on

docs: step 8.1 β€” tagline reframe + README honest-scope + rename closure
086ad86

Nomearod Claude Opus 4.6 (1M context) commited on

feat(eval): Week 1 step 5 β€” 25-question K8s golden dataset + grounded_refusal fix
4454894

Nomearod Claude Opus 4.6 (1M context) commited on

feat(serving): Week 1 step 7 β€” showcase UI launch polish
8373c87

Nomearod commited on

chore(eval): pin gpt-4o-mini snapshot + wire fastapi golden_dataset + pre-commit tolerances
5c1f49f

Nomearod Claude Opus 4.6 (1M context) commited on

feat: K8s pilot corpus β€” 8 pages + config entry + JSON rewrite
ce7247c

Nomearod Claude Opus 4.6 (1M context) commited on

feat: evaluate.py --corpus flag + CorpusConfig.golden_dataset
68d96ea

Nomearod Claude Opus 4.6 (1M context) commited on

feat: add 6-question K8s golden pilot dataset
3484214

Nomearod Claude Opus 4.6 (1M context) commited on

feat: extend GoldenQuestion with source_pages and source_sections
d5884af

Nomearod Claude Opus 4.6 (1M context) commited on

fix: batch-3 adversarial review findings
42c7303

Nomearod commited on

feat: K8s corpus config entry, ingestion target, curation policy
3c0089e

Nomearod commited on

feat: dashboard corpus selector, chip swap, chat tags
6e8d2ee

Nomearod commited on

refactor: address batch-2 review feedback
1bf7f2d

Nomearod commited on

feat: parameterized system prompt template
f56d519

Nomearod commited on

feat: SSE meta event carries corpus + corpus_label
116d6ee

Nomearod commited on

feat: per-request corpus routing via Literal validation
6456833

Nomearod commited on

security: fail-closed on secret extraction and env var leakage
6ca375c

Nomearod commited on

refactor: address batch-1 review feedback
f717b74

Nomearod commited on

feat: support multi-corpus golden dataset schema
83d6b2b

Nomearod commited on