Spaces:
Running
Running
Commit History
dashboard: add #harness + #harness-appendix sections (v3 design integration) 2d9ce3a
docs(judge): writeup draft v1 β methodology arc + position + v1.2 fix-list c093a45
calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific 504a35c
calibrate(jury): v1.1+v1.1.1 β fix weighting bugs; recency-position paraphrase clause ab0e054
rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20) e16544c unverified
Jane Yeung Claude Opus 4.7 (1M context) commited on
fix(calibration): per-corpus dispatch in generate-outputs (#19) ee729e0 unverified
Jane Yeung Claude Opus 4.7 (1M context) commited on
Merge pull request #18 from tyy0811/feat/judge-layer-v1 7ca889f unverified
Jane Yeung commited on
fix(types): four mypy errors blocking CI 02b8717
docs(harness,readme): two re-review must-fix items c39d5c7
fix(judges,calibration,harness): three Codex adversarial-review findings 226b6f4
fix(judges,calibration): five review follow-ups (items 5, 6, 7, 9, 10) 71ec5e8
fix(judges): four review-blocking bugs (review items 1β4 + 8) 9255fb5
docs+build: judge-layer v1 coupled-artifact updates 508e5ef
refactor(metrics): delete superseded LLM judges (answer_faithfulness etc.) 281b43d
refactor(harness): migrate to per-dimension Judge layer (drop faithfulness/correctness) e76227f
feat(config): add evaluation.judge_dimensions field 12cb8b7
feat(calibration): generate_kappa_table with strict/warn modes 1d47106
feat(scripts): run_calibration.py orchestrator for Steps A/C/D 4fa7c61
feat(calibration): six row configs for the ΞΊ ablation table cf57f16
feat(goldens): add source_snippets to 8 FastAPI calibration items a48afb9
feat(calibration): 30-item stratified calibration_v1 sample 8ef480a
test(calibration): sklearn-parity fixtures + cross-check CI test 3a2ed35
feat(calibration): hand-rolled cohen_kappa, gwets_ac2, bootstrap_ci 6ef2e0e
feat(variance): PermutedJudge + Jury β N permutations and multi-judge aggregator c038a7d
feat(judges): CitationFaithfulnessJudge with all-or-nothing aggregation 04d9ea0
feat(judges): CompletenessJudge + three-point reference-based rubric 80be2d8
feat(judges): RelevanceJudge + anchored three-point rubric b170eb6
feat(judges): GroundednessJudge + anchored binary rubric 30a5e0c
feat(judges): _call_judge_with_retry helper with strict-reprompt + abstain ff78845
feat(judges): MockJudge with LookupError on missing keys aa70e89
feat(judges): Judge ABC with judge_id derived from model + dimension 2192305
feat(judges): Rubric markdown loader with aggressive validation 7b72b2c
feat(judges): ScoreResult + abstain-reason constants 76e370c
test: scaffold tests/evaluation/ directory for judge-layer tests f94cea7
ci: document zero-secret contract on test job with empty env block 86ddcb7
chore(tooling): exclude scripts/_dev/ from ruff and mypy d15fbd3
docs(plans): judge-layer v1 implementation plan β 12 phases, ~50 tasks 171022a
docs(plans): judge-layer v1 design β supersede continuous-scale judges with discrete-anchored 2-judge jury + ΞΊ-validated calibration 44c65d4
docs(readme): correct test count 444 β 443 0e96cb9
Merge remote-tracking branch 'origin/main' into hf-deploy 4161c3e
Merge pull request #17 from tyy0811/dashboard-v3 fcfd067 unverified
Jane Yeung commited on
Redesign landing page: paper+ink visual system, instrumented pipeline, OWASP badges 82b6725
Merge remote-tracking branch 'origin/main' into hf-deploy efffb61
Merge pull request #16 from tyy0811/feat/chip-row-owasp-coverage-subtitle a9409b2 unverified
Jane Yeung commited on
feat(landing): OWASP coverage subtitle + LLM05 tooltip on corpus chips 414c372
Merge remote-tracking branch 'origin/main' into hf-deploy 63d835d
Merge pull request #15 from tyy0811/docs/security-llm07-residual-risk ddda523 unverified
Jane Yeung commited on