Spaces:
Running
docs+build: judge-layer v1 coupled-artifact updates
Browse filesFive coupled artifacts updated to reflect the supersession (Phase 9
of the judge-layer v1 plan):
1. Makefile β adds calibrate (full pipeline) and evaluate-judges
(re-score existing outputs) targets. Both invoke build-table
--strict so partial-coverage warnings are caught at build time
rather than landing in the writeup.
2. README.md β adds 'Targets that cost money' four-column table
(target / API key / approximate cost / output) so anyone running
make commands knows the cost upfront. Test count corrected
443 β 509 to match current state.
3. DECISIONS.md β appends supersession entry defended by file
paths (labels JSONL, per-row predictions, kappa_table.md, writeup)
so future readers can trace any claim to its data.
4. measurements/README.md β adds the calibration-labels JSONL row,
namechecking the DECISIONS.md entry it backs and the ΞΊ table file
path it feeds.
(docs/DESIGN.md is workspace-only / .gitignored, so the equivalent
section rewrite there is a local edit only and not part of this
commit.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DECISIONS.md +44 -0
- Makefile +16 -1
- README.md +14 -1
- measurements/README.md +1 -0
|
@@ -2116,3 +2116,47 @@ the actual container filesystem would have caught it pre-deploy.
|
|
| 2116 |
Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
|
| 2117 |
build infrastructure) but is the right long-term mitigation for this
|
| 2118 |
class of bug.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2116 |
Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
|
| 2117 |
build infrastructure) but is the right long-term mitigation for this
|
| 2118 |
class of bug.
|
| 2119 |
+
|
| 2120 |
+
## LLM-judge layer supersession β discrete-anchored 2-judge jury replaces continuous-score single-call
|
| 2121 |
+
|
| 2122 |
+
The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
|
| 2123 |
+
(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
|
| 2124 |
+
and replaced by the per-dimension Judge layer at
|
| 2125 |
+
`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
|
| 2126 |
+
|
| 2127 |
+
**Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
|
| 2128 |
+
|
| 2129 |
+
**Why this is a supersession, not a refactor.** The new layer differs from
|
| 2130 |
+
the old on six axes: discrete-anchored scale (vs continuous 0β1),
|
| 2131 |
+
reasoning-before-score JSON ordering (vs score-first), per-dimension
|
| 2132 |
+
judges (vs combined faithfulness/correctness), full provenance per call
|
| 2133 |
+
(judge_id + rubric_version + system_output_hash + prompt_seed; old had
|
| 2134 |
+
none), composable variance wrappers (rubric_permute, jury β old was
|
| 2135 |
+
single-call), and an intentional abstain-vs-raise discipline (vs silent
|
| 2136 |
+
`None` from a bare `except Exception`).
|
| 2137 |
+
|
| 2138 |
+
**Evidence backing the supersession claim** β the calibration ΞΊ table
|
| 2139 |
+
quantifies the new layer's agreement with hand-labels across 6 ablation
|
| 2140 |
+
rows (baseline + 3 variance ablations + permute + 2-judge jury). The
|
| 2141 |
+
files defending this entry's claim, by file path:
|
| 2142 |
+
|
| 2143 |
+
- `measurements/2026-05-04-judge-calibration-labels.jsonl` β 30 items Γ 3
|
| 2144 |
+
dimensions hand-labeled (UK AISI bio/chem ΞΊ ~0.8 cited as the
|
| 2145 |
+
literature ceiling). Lands in Phase 10.
|
| 2146 |
+
- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
|
| 2147 |
+
`_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
|
| 2148 |
+
`_permute.json`, `_jury_kappa_weighted.json` β per-row predictions.
|
| 2149 |
+
Land in Phase 11.
|
| 2150 |
+
- `docs/_generated/kappa_table.md` β generated ΞΊ ablation table copy-
|
| 2151 |
+
pasted into the writeup. Lands in Phase 11.
|
| 2152 |
+
- `docs/judge-design.md` β interpretive writeup with the closing
|
| 2153 |
+
"when NOT to use LLM-judge" position. Lands in Phase 12.
|
| 2154 |
+
|
| 2155 |
+
**Config-knob preservation.** `evaluation.judge_provider` is unchanged
|
| 2156 |
+
across all 5 YAML configs; new `evaluation.judge_dimensions` field
|
| 2157 |
+
defaults to the three v1 dimensions. Zero user-facing config migration.
|
| 2158 |
+
|
| 2159 |
+
**Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
|
| 2160 |
+
Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
|
| 2161 |
+
prompt optimization, citation_faithfulness in the default
|
| 2162 |
+
judge_dimensions, AC2 sympy-derived parity tests.
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
|
| 2 |
|
| 3 |
-
.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
|
| 4 |
|
| 5 |
install:
|
| 6 |
$(PYTHON) -m pip install -e ".[dev]"
|
|
@@ -34,6 +34,21 @@ benchmark:
|
|
| 34 |
evaluate-langchain:
|
| 35 |
$(PYTHON) scripts/run_langchain_eval.py --provider openai
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
docker:
|
| 38 |
docker-compose -f docker/docker-compose.yaml up --build
|
| 39 |
|
|
|
|
| 1 |
PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
|
| 2 |
|
| 3 |
+
.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
|
| 4 |
|
| 5 |
install:
|
| 6 |
$(PYTHON) -m pip install -e ".[dev]"
|
|
|
|
| 34 |
evaluate-langchain:
|
| 35 |
$(PYTHON) scripts/run_langchain_eval.py --provider openai
|
| 36 |
|
| 37 |
+
calibrate: ## Run full calibration pipeline (system outputs β all rows β strict ΞΊ table). Costs ~$2 in API calls.
|
| 38 |
+
$(PYTHON) scripts/run_calibration.py generate-outputs
|
| 39 |
+
@for cfg in configs/calibration/rows/*.yaml; do \
|
| 40 |
+
echo "==> running judges for $$cfg"; \
|
| 41 |
+
$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
|
| 42 |
+
done
|
| 43 |
+
$(PYTHON) scripts/run_calibration.py build-table --strict
|
| 44 |
+
|
| 45 |
+
evaluate-judges: ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1.
|
| 46 |
+
@for cfg in configs/calibration/rows/*.yaml; do \
|
| 47 |
+
echo "==> running judges for $$cfg"; \
|
| 48 |
+
$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
|
| 49 |
+
done
|
| 50 |
+
$(PYTHON) scripts/run_calibration.py build-table --strict
|
| 51 |
+
|
| 52 |
docker:
|
| 53 |
docker-compose -f docker/docker-compose.yaml up --build
|
| 54 |
|
|
@@ -302,12 +302,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval Β· 3
|
|
| 302 |
## Testing
|
| 303 |
|
| 304 |
```bash
|
| 305 |
-
make test #
|
| 306 |
make lint # ruff + mypy
|
| 307 |
```
|
| 308 |
|
| 309 |
All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
|
| 310 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 311 |
## Design Decisions
|
| 312 |
|
| 313 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
|
|
|
|
| 302 |
## Testing
|
| 303 |
|
| 304 |
```bash
|
| 305 |
+
make test # 509 deterministic tests, no API keys needed
|
| 306 |
make lint # ruff + mypy
|
| 307 |
```
|
| 308 |
|
| 309 |
All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
|
| 310 |
|
| 311 |
+
### Targets that cost money
|
| 312 |
+
|
| 313 |
+
These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
|
| 314 |
+
|
| 315 |
+
| Target | Requires API key | Approximate cost | What it produces |
|
| 316 |
+
|---|---|---|---|
|
| 317 |
+
| `make evaluate-full` | OpenAI or Anthropic | $0.01β0.05 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json` |
|
| 318 |
+
| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
|
| 319 |
+
| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
|
| 320 |
+
| `make evaluate-langchain` | OpenAI or Anthropic | $0.01β0.05 per run | LangChain baseline harness for the comparison report |
|
| 321 |
+
|
| 322 |
+
Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
|
| 323 |
+
|
| 324 |
## Design Decisions
|
| 325 |
|
| 326 |
See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
|
|
@@ -12,3 +12,4 @@ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
|
|
| 12 |
|
| 13 |
Current entries:
|
| 14 |
- `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β assumption falsified, fix deferred to v1.1 at the right cause."
|
|
|
|
|
|
| 12 |
|
| 13 |
Current entries:
|
| 14 |
- `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β assumption falsified, fix deferred to v1.1 at the right cause."
|
| 15 |
+
- `2026-05-04-judge-calibration-labels.jsonl` β 30 items Γ 3 dimensions hand-labels (single rater) for the ΞΊ ablation table in `docs/_generated/kappa_table.md` and the writeup at `docs/judge-design.md`. Backs the DECISIONS.md entry "LLM-judge layer supersession β discrete-anchored 2-judge jury replaces continuous-score single-call". Lands in Phase 10 (manual labeling).
|