Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.7 (1M context) commited on 26 days ago

Commit

508e5ef

1 Parent(s): 281b43d

docs+build: judge-layer v1 coupled-artifact updates

Five coupled artifacts updated to reflect the supersession (Phase 9
of the judge-layer v1 plan):

1. Makefile — adds calibrate (full pipeline) and evaluate-judges
(re-score existing outputs) targets. Both invoke build-table
--strict so partial-coverage warnings are caught at build time
rather than landing in the writeup.

2. README.md — adds 'Targets that cost money' four-column table
(target / API key / approximate cost / output) so anyone running
make commands knows the cost upfront. Test count corrected
443 → 509 to match current state.

3. DECISIONS.md — appends supersession entry defended by file
paths (labels JSONL, per-row predictions, kappa_table.md, writeup)
so future readers can trace any claim to its data.

4. measurements/README.md — adds the calibration-labels JSONL row,
namechecking the DECISIONS.md entry it backs and the κ table file
path it feeds.

(docs/DESIGN.md is workspace-only / .gitignored, so the equivalent
section rewrite there is a local edit only and not part of this
commit.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

DECISIONS.md +44 -0
Makefile +16 -1
README.md +14 -1
measurements/README.md +1 -0

DECISIONS.md CHANGED Viewed

@@ -2116,3 +2116,47 @@ the actual container filesystem would have caught it pre-deploy.
 Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
 build infrastructure) but is the right long-term mitigation for this
 class of bug.

 Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
 build infrastructure) but is the right long-term mitigation for this
 class of bug.
+## LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call
+The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
+(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
+and replaced by the per-dimension Judge layer at
+`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
+**Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
+**Why this is a supersession, not a refactor.** The new layer differs from
+the old on six axes: discrete-anchored scale (vs continuous 0–1),
+reasoning-before-score JSON ordering (vs score-first), per-dimension
+judges (vs combined faithfulness/correctness), full provenance per call
+(judge_id + rubric_version + system_output_hash + prompt_seed; old had
+none), composable variance wrappers (rubric_permute, jury — old was
+single-call), and an intentional abstain-vs-raise discipline (vs silent
+`None` from a bare `except Exception`).
+**Evidence backing the supersession claim** — the calibration κ table
+quantifies the new layer's agreement with hand-labels across 6 ablation
+rows (baseline + 3 variance ablations + permute + 2-judge jury). The
+files defending this entry's claim, by file path:
+- `measurements/2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3
+  dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the
+  literature ceiling). Lands in Phase 10.
+- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
+  `_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
+  `_permute.json`, `_jury_kappa_weighted.json` — per-row predictions.
+  Land in Phase 11.
+- `docs/_generated/kappa_table.md` — generated κ ablation table copy-
+  pasted into the writeup. Lands in Phase 11.
+- `docs/judge-design.md` — interpretive writeup with the closing
+  "when NOT to use LLM-judge" position. Lands in Phase 12.
+**Config-knob preservation.** `evaluation.judge_provider` is unchanged
+across all 5 YAML configs; new `evaluation.judge_dimensions` field
+defaults to the three v1 dimensions. Zero user-facing config migration.
+**Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
+Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
+prompt optimization, citation_faithfulness in the default
+judge_dimensions, AC2 sympy-derived parity tests.

Makefile CHANGED Viewed

@@ -1,6 +1,6 @@
 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
-.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
@@ -34,6 +34,21 @@ benchmark:
 evaluate-langchain:
 	$(PYTHON) scripts/run_langchain_eval.py --provider openai
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build

 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
+.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
 evaluate-langchain:
 	$(PYTHON) scripts/run_langchain_eval.py --provider openai
+calibrate:  ## Run full calibration pipeline (system outputs → all rows → strict κ table). Costs ~$2 in API calls.
+	$(PYTHON) scripts/run_calibration.py generate-outputs
+	@for cfg in configs/calibration/rows/*.yaml; do \
+		echo "==> running judges for $$cfg"; \
+		$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
+	done
+	$(PYTHON) scripts/run_calibration.py build-table --strict
+evaluate-judges:  ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1.
+	@for cfg in configs/calibration/rows/*.yaml; do \
+		echo "==> running judges for $$cfg"; \
+		$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
+	done
+	$(PYTHON) scripts/run_calibration.py build-table --strict
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build

README.md CHANGED Viewed

@@ -302,12 +302,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3
 ## Testing
 ```bash
-make test    # 443 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
 ## Design Decisions
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.

 ## Testing
 ```bash
+make test    # 509 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
+### Targets that cost money
+These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
+| Target | Requires API key | Approximate cost | What it produces |
+|---|---|---|---|
+| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.05 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json` |
+| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
+| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
+| `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
+Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
 ## Design Decisions
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.

measurements/README.md CHANGED Viewed

@@ -12,3 +12,4 @@ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
 Current entries:
 - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."

 Current entries:
 - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."
+- `2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3 dimensions hand-labels (single rater) for the κ ablation table in `docs/_generated/kappa_table.md` and the writeup at `docs/judge-design.md`. Backs the DECISIONS.md entry "LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call". Lands in Phase 10 (manual labeling).