Nomearod Claude Opus 4.7 (1M context) commited on
Commit
508e5ef
Β·
1 Parent(s): 281b43d

docs+build: judge-layer v1 coupled-artifact updates

Browse files

Five coupled artifacts updated to reflect the supersession (Phase 9
of the judge-layer v1 plan):

1. Makefile β€” adds calibrate (full pipeline) and evaluate-judges
(re-score existing outputs) targets. Both invoke build-table
--strict so partial-coverage warnings are caught at build time
rather than landing in the writeup.

2. README.md β€” adds 'Targets that cost money' four-column table
(target / API key / approximate cost / output) so anyone running
make commands knows the cost upfront. Test count corrected
443 β†’ 509 to match current state.

3. DECISIONS.md β€” appends supersession entry defended by file
paths (labels JSONL, per-row predictions, kappa_table.md, writeup)
so future readers can trace any claim to its data.

4. measurements/README.md β€” adds the calibration-labels JSONL row,
namechecking the DECISIONS.md entry it backs and the ΞΊ table file
path it feeds.

(docs/DESIGN.md is workspace-only / .gitignored, so the equivalent
section rewrite there is a local edit only and not part of this
commit.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show
  1. DECISIONS.md +44 -0
  2. Makefile +16 -1
  3. README.md +14 -1
  4. measurements/README.md +1 -0
DECISIONS.md CHANGED
@@ -2116,3 +2116,47 @@ the actual container filesystem would have caught it pre-deploy.
2116
  Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
2117
  build infrastructure) but is the right long-term mitigation for this
2118
  class of bug.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2116
  Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
2117
  build infrastructure) but is the right long-term mitigation for this
2118
  class of bug.
2119
+
2120
+ ## LLM-judge layer supersession β€” discrete-anchored 2-judge jury replaces continuous-score single-call
2121
+
2122
+ The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
2123
+ (`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
2124
+ and replaced by the per-dimension Judge layer at
2125
+ `agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
2126
+
2127
+ **Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
2128
+
2129
+ **Why this is a supersession, not a refactor.** The new layer differs from
2130
+ the old on six axes: discrete-anchored scale (vs continuous 0–1),
2131
+ reasoning-before-score JSON ordering (vs score-first), per-dimension
2132
+ judges (vs combined faithfulness/correctness), full provenance per call
2133
+ (judge_id + rubric_version + system_output_hash + prompt_seed; old had
2134
+ none), composable variance wrappers (rubric_permute, jury β€” old was
2135
+ single-call), and an intentional abstain-vs-raise discipline (vs silent
2136
+ `None` from a bare `except Exception`).
2137
+
2138
+ **Evidence backing the supersession claim** β€” the calibration ΞΊ table
2139
+ quantifies the new layer's agreement with hand-labels across 6 ablation
2140
+ rows (baseline + 3 variance ablations + permute + 2-judge jury). The
2141
+ files defending this entry's claim, by file path:
2142
+
2143
+ - `measurements/2026-05-04-judge-calibration-labels.jsonl` β€” 30 items Γ— 3
2144
+ dimensions hand-labeled (UK AISI bio/chem ΞΊ ~0.8 cited as the
2145
+ literature ceiling). Lands in Phase 10.
2146
+ - `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
2147
+ `_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
2148
+ `_permute.json`, `_jury_kappa_weighted.json` β€” per-row predictions.
2149
+ Land in Phase 11.
2150
+ - `docs/_generated/kappa_table.md` β€” generated ΞΊ ablation table copy-
2151
+ pasted into the writeup. Lands in Phase 11.
2152
+ - `docs/judge-design.md` β€” interpretive writeup with the closing
2153
+ "when NOT to use LLM-judge" position. Lands in Phase 12.
2154
+
2155
+ **Config-knob preservation.** `evaluation.judge_provider` is unchanged
2156
+ across all 5 YAML configs; new `evaluation.judge_dimensions` field
2157
+ defaults to the three v1 dimensions. Zero user-facing config migration.
2158
+
2159
+ **Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
2160
+ Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
2161
+ prompt optimization, citation_faithfulness in the default
2162
+ judge_dimensions, AC2 sympy-derived parity tests.
Makefile CHANGED
@@ -1,6 +1,6 @@
1
  PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
2
 
3
- .PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
4
 
5
  install:
6
  $(PYTHON) -m pip install -e ".[dev]"
@@ -34,6 +34,21 @@ benchmark:
34
  evaluate-langchain:
35
  $(PYTHON) scripts/run_langchain_eval.py --provider openai
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  docker:
38
  docker-compose -f docker/docker-compose.yaml up --build
39
 
 
1
  PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
2
 
3
+ .PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
4
 
5
  install:
6
  $(PYTHON) -m pip install -e ".[dev]"
 
34
  evaluate-langchain:
35
  $(PYTHON) scripts/run_langchain_eval.py --provider openai
36
 
37
+ calibrate: ## Run full calibration pipeline (system outputs β†’ all rows β†’ strict ΞΊ table). Costs ~$2 in API calls.
38
+ $(PYTHON) scripts/run_calibration.py generate-outputs
39
+ @for cfg in configs/calibration/rows/*.yaml; do \
40
+ echo "==> running judges for $$cfg"; \
41
+ $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
42
+ done
43
+ $(PYTHON) scripts/run_calibration.py build-table --strict
44
+
45
+ evaluate-judges: ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1.
46
+ @for cfg in configs/calibration/rows/*.yaml; do \
47
+ echo "==> running judges for $$cfg"; \
48
+ $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
49
+ done
50
+ $(PYTHON) scripts/run_calibration.py build-table --strict
51
+
52
  docker:
53
  docker-compose -f docker/docker-compose.yaml up --build
54
 
README.md CHANGED
@@ -302,12 +302,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval Β· 3
302
  ## Testing
303
 
304
  ```bash
305
- make test # 443 deterministic tests, no API keys needed
306
  make lint # ruff + mypy
307
  ```
308
 
309
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
310
 
 
 
 
 
 
 
 
 
 
 
 
 
 
311
  ## Design Decisions
312
 
313
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
 
302
  ## Testing
303
 
304
  ```bash
305
+ make test # 509 deterministic tests, no API keys needed
306
  make lint # ruff + mypy
307
  ```
308
 
309
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
310
 
311
+ ### Targets that cost money
312
+
313
+ These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
314
+
315
+ | Target | Requires API key | Approximate cost | What it produces |
316
+ |---|---|---|---|
317
+ | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.05 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json` |
318
+ | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
319
+ | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
320
+ | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
321
+
322
+ Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
323
+
324
  ## Design Decisions
325
 
326
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
measurements/README.md CHANGED
@@ -12,3 +12,4 @@ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
12
 
13
  Current entries:
14
  - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β€” HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause."
 
 
12
 
13
  Current entries:
14
  - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β€” HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause."
15
+ - `2026-05-04-judge-calibration-labels.jsonl` β€” 30 items Γ— 3 dimensions hand-labels (single rater) for the ΞΊ ablation table in `docs/_generated/kappa_table.md` and the writeup at `docs/judge-design.md`. Backs the DECISIONS.md entry "LLM-judge layer supersession β€” discrete-anchored 2-judge jury replaces continuous-score single-call". Lands in Phase 10 (manual labeling).