Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

Commit History

Merge remote-tracking branch 'origin/main' into hf-deploy

4158bba

Running

Nomearod commited on May 6

dashboard: add #harness + #harness-appendix sections (v3 design integration)

2d9ce3a

Nomearod Claude Opus 4.7 (1M context) commited on May 6

docs(judge): writeup draft v1 — methodology arc + position + v1.2 fix-list

c093a45

Nomearod Claude Opus 4.7 (1M context) commited on May 6

calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific

504a35c

Nomearod Claude Opus 4.7 (1M context) commited on May 6

calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause

ab0e054

Nomearod Claude Opus 4.7 (1M context) commited on May 6

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)

e16544c
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on May 5

fix(calibration): per-corpus dispatch in generate-outputs (#19)

ee729e0
unverified

Jane Yeung Claude Opus 4.7 (1M context) commited on May 4

Merge pull request #18 from tyy0811/feat/judge-layer-v1

7ca889f
unverified

Jane Yeung commited on May 4

fix(types): four mypy errors blocking CI

02b8717

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(harness,readme): two re-review must-fix items

c39d5c7

Nomearod Claude Opus 4.7 (1M context) commited on May 4

fix(judges,calibration,harness): three Codex adversarial-review findings

226b6f4

Nomearod Claude Opus 4.7 (1M context) commited on May 4

fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10)

71ec5e8

Nomearod Claude Opus 4.7 (1M context) commited on May 4

fix(judges): four review-blocking bugs (review items 1–4 + 8)

9255fb5

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs+build: judge-layer v1 coupled-artifact updates

508e5ef

Nomearod Claude Opus 4.7 (1M context) commited on May 4

refactor(metrics): delete superseded LLM judges (answer_faithfulness etc.)

281b43d

Nomearod Claude Opus 4.7 (1M context) commited on May 4

refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness)

e76227f

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(config): add evaluation.judge_dimensions field

12cb8b7

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(calibration): generate_kappa_table with strict/warn modes

1d47106

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(scripts): run_calibration.py orchestrator for Steps A/C/D

4fa7c61

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(calibration): six row configs for the κ ablation table

cf57f16

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(goldens): add source_snippets to 8 FastAPI calibration items

a48afb9

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(calibration): 30-item stratified calibration_v1 sample

8ef480a

Nomearod Claude Opus 4.7 (1M context) commited on May 4

test(calibration): sklearn-parity fixtures + cross-check CI test

3a2ed35

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci

6ef2e0e

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(variance): PermutedJudge + Jury — N permutations and multi-judge aggregator

c038a7d

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation

04d9ea0

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): CompletenessJudge + three-point reference-based rubric

80be2d8

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): RelevanceJudge + anchored three-point rubric

b170eb6

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): GroundednessJudge + anchored binary rubric

30a5e0c

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain

ff78845

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): MockJudge with LookupError on missing keys

aa70e89

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): Judge ABC with judge_id derived from model + dimension

2192305

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): Rubric markdown loader with aggressive validation

7b72b2c

Nomearod Claude Opus 4.7 (1M context) commited on May 4

feat(judges): ScoreResult + abstain-reason constants

76e370c

Nomearod Claude Opus 4.7 (1M context) commited on May 4

test: scaffold tests/evaluation/ directory for judge-layer tests

f94cea7

Nomearod Claude Opus 4.7 (1M context) commited on May 4

ci: document zero-secret contract on test job with empty env block

86ddcb7

Nomearod Claude Opus 4.7 (1M context) commited on May 4

chore(tooling): exclude scripts/_dev/ from ruff and mypy

d15fbd3

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(plans): judge-layer v1 implementation plan — 12 phases, ~50 tasks

171022a

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(plans): judge-layer v1 design — supersede continuous-scale judges with discrete-anchored 2-judge jury + κ-validated calibration

44c65d4

Nomearod Claude Opus 4.7 (1M context) commited on May 4

docs(readme): correct test count 444 → 443

0e96cb9

Nomearod Claude Opus 4.7 (1M context) commited on May 4

Merge remote-tracking branch 'origin/main' into hf-deploy

4161c3e

Nomearod commited on Apr 30

Merge pull request #17 from tyy0811/dashboard-v3

fcfd067
unverified

Jane Yeung commited on Apr 30

Redesign landing page: paper+ink visual system, instrumented pipeline, OWASP badges

82b6725

Nomearod Claude Opus 4.7 (1M context) commited on Apr 29

Merge remote-tracking branch 'origin/main' into hf-deploy

efffb61

Nomearod commited on Apr 23

Merge pull request #16 from tyy0811/feat/chip-row-owasp-coverage-subtitle

a9409b2
unverified

Jane Yeung commited on Apr 23

feat(landing): OWASP coverage subtitle + LLM05 tooltip on corpus chips

414c372

Nomearod Claude Opus 4.7 (1M context) commited on Apr 23

Merge remote-tracking branch 'origin/main' into hf-deploy

63d835d

Nomearod commited on Apr 23

Merge pull request #15 from tyy0811/docs/security-llm07-residual-risk

ddda523
unverified

Jane Yeung commited on Apr 23

docs(security): LLM07 named residual risk — injection classifier coverage gap

13317a0

Nomearod Claude Opus 4.7 (1M context) commited on Apr 23

Merge remote-tracking branch 'origin/main' into hf-deploy

c750a10

Nomearod commited on Apr 23