Spaces:
Paused
Paused
File size: 30,417 Bytes
f2df60e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | # evaluation_tests.md — Test Plan for `docs/modules/evaluation.md`
**Target modules:** `training/eval_baseline.py`, `training/eval_final.py`, `training/probe_reward_hacking.py` (aliased `training/probe.py` in test imports), `training/plots.py`
**Spec doc:** `DRIFTCALL/docs/modules/evaluation.md` (final sealed, 2026-04-26)
**Cross-refs:** `DRIFTCALL/docs/modules/training.md` §2.1 (`eval()` contract, §4.2 `EvalReport`), `DRIFTCALL/docs/modules/rewards.md` §3.1 (purity), §3.6 (exploit classes), §4.2 (`Rewards.breakdown`), `DRIFTCALL/docs/modules/datasets.md` §4.7 (val split), §5 (catalogue hashes), `DRIFTCALL/CLAUDE.md` §3.1 (nine-section test-plan doc).
**Framework:** `pytest` + `hypothesis` + `unittest.mock` + `pytest-mpl` (plot image-compare, tolerance-based).
**Owner:** Person B (Rewards & Tests), co-signed by Person C (Training) for the `training.eval` delegation path.
**CUDA policy:** **Model inference is mocked by default.** Every test that would touch a real LoRA adapter or base weight goes through the `stubbed_training_eval` fixture (§5.3), which monkeypatches `training.train.eval` to return a hand-crafted `EvalReport`. Zero CUDA calls in CI. A single `@pytest.mark.cuda` integration test exercises the real delegation path and is skipped when `torch.cuda.is_available() is False` (the default laptop / CI environment).
**Deterministic RNG:** `numpy.random.default_rng(20260426)` for baseline + final bootstrap, `numpy.random.default_rng(20260427)` for the probe CI, `numpy.random.default_rng(20260428)` for paired-difference bootstrap. Seeds are frozen in `evaluation.md` §2.4 and re-asserted at every call site.
**Numeric tolerance:** `math.isclose(a, b, abs_tol=1e-9, rel_tol=0.0)` for scalar floats; `numpy.testing.assert_allclose(..., atol=1e-6, rtol=0.0)` for sample arrays; byte-exact for serialized JSON (`sort_keys=True, separators=(",", ":")`); image diff tolerance `rms < 0.5` pixel for plot snapshots.
This plan specifies **100% line coverage** and **≥ 95% branch coverage** on `training/eval_baseline.py`, `training/eval_final.py`, `training/probe.py`, and `training/plots.py`. Every function signature in evaluation.md §2, every behavior clause in §3.1–§3.8, every error mode in §5, every data-structure invariant in §4, and every edge case in §7 has at least one dedicated test.
**Fixtures contract (§5).** This plan is the **source of truth** for all eval-related fixtures. Bidirectionally-consistent sharing map (see §5.6 for the authoritative truth-verified table):
- `eval_50_episodes_val_slice` — defined here (§5.1). **Imported** by `training_tests.md` §3.4 only (via cross-reference at `training_tests.md §5.6`).
- `baseline_eval_report_fixture`, `final_eval_report_fixture` — defined here (§5.2, §5.3). **Eval-only**; not consumed by any other plan.
- `probe_report_no_exploits` — defined here (§5.4). **Imported** by `pitch_demo_tests.md` §5.5 (blog badge I12) only.
- `probe_report_with_novel_class` — defined here (§5.5). Eval-only.
Any change to fixture bodies MUST be mirrored in `tests/conftest.py` in lockstep. PR reviewers for any consuming plan must verify content parity against this plan's §5.
---
## 1. Unit tests
**Organisation:** one `pytest` module per behavior cluster. Zero real inference — every test either uses `stubbed_training_eval` or asserts against pure-Python helpers.
File layout under `tests/test_evaluation/`:
```
tests/test_evaluation/
__init__.py
test_run_eval_signature.py # run_eval signature + delegation (§2.1, §3.2)
test_episode_selection_deterministic.py # val[0:50], seed hashing, leak guard (§3.1)
test_sampling_policy_frozen.py # T=0, num_gen=1, eval(), no_grad (§3.2)
test_aggregation_bootstrap_ci.py # per-reward CI store (§3.3)
test_per_language_cohort.py # cohort means + low-n rendering (§3.4)
test_drift_detection_latency.py # §3.5 latency compute + Stage-1 NaN
test_probe_scanner_mechanics.py # §3.6 scan + Counter aggregation
test_probe_novel_class.py # §3.6 unknown offense code path
test_probe_on_base_guard.py # ProbeOnBaseModelError (§5)
test_probe_insufficient_samples.py # ProbeInsufficientSamplesError (§5)
test_episode_set_leak_error.py # EpisodeSetLeakError on mismatch (§5)
test_catalogue_hash_mismatch.py # CatalogueHashMismatchError (§5)
test_eval_budget_exceeded.py # EvalBudgetExceededError all three ceilings (§3.8, §5)
test_bootstrap_edge_cases.py # len 0 / 1 / all-identical samples (§2.4)
test_zero_success_baseline.py # warning + CI undefined breakdown (§7.1)
test_plot_rendering.py # all 4 curves, PNG exists + shape (§2.1, §3.5)
test_plot_graceful_degrade.py # WandB unavailable skips 2 plots (§3.5, §7.6)
test_render_probe_report_md.py # fixed 35-line template (§2.3, §4.5)
test_probe_report_json_roundtrip.py # sort_keys serialization (§4.4)
test_exploit_classes_always_emitted.py # all 5 in report even when zero (§3.6)
```
**Unit test case inventory — 27 cases total (exceeds the ≥ 25 requirement).**
### 1.1 `run_eval` signature + delegation — `test_run_eval_signature.py`
**Scope:** `run_eval(model_path, episodes=50) -> EvalReport` is a thin wrapper over `training.train.eval`; it must pass `model_path` through verbatim, default `episodes=50`, and return the delegate's `EvalReport` unchanged.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U1 | `test_run_eval_default_episodes_50` | Stub `training.train.eval` to capture kwargs; call `run_eval("base")`. | Delegate invoked with `model_path="base"` and `episodes=50`. |
| U2 | `test_run_eval_accepts_literal_base` | `run_eval("base", 50)`. | No error; returns stubbed `EvalReport` with `model_path == "base"`. |
| U3 | `test_run_eval_accepts_path_object` | `run_eval(Path("/tmp/ckpt"), 50)`. | Delegate invoked with `model_path=Path("/tmp/ckpt")`. |
| U4 | `test_run_eval_propagates_model_load_error` | Stub raises `EvalModelLoadError("adapter missing")`. | `run_eval` re-raises `EvalModelLoadError`; no silent fallback. |
### 1.2 Episode selection determinism — `test_episode_selection_deterministic.py`
**Scope:** evaluation.md §3.1 — baseline and final both read `val/briefs.jsonl[0:50]` in file order; probe reads `[50:250]`; env seed = `hash((episode_id, "eval")) & 0xFFFFFFFF`; paired `(episode_id, seed)` tuples must match between baseline and final.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U5 | `test_eval_reads_first_50_rows_in_file_order` | `eval_50_episodes_val_slice` fixture + mock `load_briefs` recording `.take(50)`. Call `run_eval("base", 50)`. | `load_briefs` received exactly the first 50 `BriefRow`s; episode_ids order matches file order. |
| U6 | `test_probe_reads_rows_50_to_250` | Mock `load_briefs` with 500-row fixture. Call `probe_reward_hacking(Path("/ckpt"), 200)`. | The 200 `BriefRow`s passed to `training.eval` are rows `[50:250]` — disjoint from the paired 50, confirmed by episode_id set intersection == ∅. |
| U7 | `test_env_seed_is_hash_tuple_episode_id_eval` | Mock `env.reset` to record seed. | For each episode, recorded `seed == hash((episode_id, "eval")) & 0xFFFFFFFF`. |
| U8 | `test_baseline_and_final_share_same_seeds` | `stubbed_training_eval` records seeds per run. Run baseline then final. | `baseline.breakdown["episode_ids"] == final.breakdown["episode_ids"]`; zipped seeds are pairwise identical. |
| U9 | `test_episode_set_leak_error_raised_on_mismatch` | Baseline fixture with episode_ids `[a,b,c,...]`; final fixture with `[a,b,X,...]`. Call post-run guard. | Raises `EpisodeSetLeakError` with substring `"paired-comparison invariant"`. |
### 1.3 Sampling policy frozen — `test_sampling_policy_frozen.py`
**Scope:** evaluation.md §3.2 — greedy `temperature=0.0`, `top_k=1`, `num_generations=1`, `model.eval()`, `torch.no_grad()`, all dropouts OFF. `run_eval` re-asserts these at entry.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U10 | `test_run_eval_enforces_temperature_zero` | Stub `training.eval` with a capture-then-assert that records sampling kwargs. | Captured `temperature == 0.0`; `top_k == 1`; `num_generations == 1`. |
| U11 | `test_run_eval_wraps_in_no_grad_and_eval_mode` | Mock torch context; assert `model.eval()` called and `torch.no_grad()` context entered before first forward. | `model.eval` call_count ≥ 1; `no_grad.__enter__` called before first forward. |
| U12 | `test_run_eval_dropouts_off` | Stub model with `.train()` recorded; assert `.eval()` wins and dropout modules report `.training is False`. | All `nn.Dropout` / LoRA-dropout modules have `training is False` at sample time. |
### 1.4 Aggregation + bootstrap CI — `test_aggregation_bootstrap_ci.py`
**Scope:** evaluation.md §3.3 — `bootstrap_ci(samples, n_boot=10_000, alpha=0.05, rng_seed=20260426)` called once per reward channel; results stored on `EvalReport.r{i}_mean_ci` tuple.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U13 | `test_bootstrap_ci_default_n_boot_10000` | Call `bootstrap_ci(tuple(range(50)), n_boot=10_000, rng_seed=20260426)`. | Returned `(mean, lo, hi)` triple; `lo < mean < hi`; mean within `abs_tol=1e-9` of arithmetic mean. |
| U14 | `test_bootstrap_ci_deterministic_with_seed` | Call twice with identical args. | Byte-identical `(mean, lo, hi)` on both calls (re-run determinism). |
| U15 | `test_paired_difference_ci_uses_seed_20260428` | `paired_difference_ci(baseline, final)` with mocked rng capture. | `numpy.random.default_rng` called with `20260428`; output tuple reproducible. |
### 1.5 Bootstrap edge cases — `test_bootstrap_edge_cases.py`
**Scope:** evaluation.md §2.4 — len 0 → `(nan, nan, nan)`; len 1 → `(v, v, v)`; all-identical → `(v, v, v)` no variance.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U16 | `test_bootstrap_ci_len0_returns_all_nan` | `bootstrap_ci(tuple(), n_boot=10_000)`. | All three outputs `math.isnan(...)`. |
| U17 | `test_bootstrap_ci_len1_returns_triple_v` | `bootstrap_ci((0.42,), n_boot=10_000)`. | Returns `(0.42, 0.42, 0.42)` exactly. |
### 1.6 Per-language cohort rendering — `test_per_language_cohort.py`
**Scope:** evaluation.md §3.4 — cohorts with `n_episodes >= 5` render numeric mean + 95% CI; `1 <= n <= 4` renders striped low-n bar with label `(low-n)`; `n == 0` renders empty slot labelled `(no episodes)`. No CI computed for low-n or empty cohorts.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U18 | `test_cohort_ge_5_renders_numeric_with_ci` | Fixture cohort of 11 episodes. | Rendered bar payload has `n_episodes == 11`, numeric `mean`, `ci != None`. |
| U19 | `test_cohort_low_n_1_to_4_renders_striped` | Fixture cohort of 3 episodes. | Rendered bar carries `style == "striped"`, `label.endswith("(low-n)")`, `ci is None`. |
| U20 | `test_cohort_empty_renders_empty_slot` | Fixture cohort of 0 episodes. | Rendered entry has `n_episodes == 0`, `label.endswith("(no episodes)")`, `mean is float("nan")`. |
### 1.7 Probe scanner + novel class — `test_probe_scanner_mechanics.py`, `test_probe_novel_class.py`
**Scope:** evaluation.md §3.6 — all 5 exploit classes always enumerated in the report (count ≥ 0); any `offense.code ∉ EXPLOIT_CLASSES` surfaces as a Novel exploit (threshold = 1 occurrence).
| # | Name | Setup | Assertion |
|---|---|---|---|
| U21 | `test_probe_emits_all_5_known_classes_even_when_zero` | Fixture rewards with zero anti-hack offenses. Call `probe_reward_hacking`. | `len(report.per_class) >= 5`; every member of `EXPLOIT_CLASSES` present; each has `count == 0`, `rate == 0.0`, `example_episode_id is None`. |
| U22 | `test_probe_novel_class_surfaced_on_single_instance` | Fixture: one episode with `offense.code == "zero_width_evasion"`. | `report.novel_classes == ("zero_width_evasion",)`; summary row for it has `count == 1`; markdown contains substring `"UNKNOWN EXPLOIT CLASS"`. |
### 1.8 Probe guards — `test_probe_on_base_guard.py`, `test_probe_insufficient_samples.py`
**Scope:** evaluation.md §5 — `ProbeOnBaseModelError` when `model_path=="base"`; `ProbeInsufficientSamplesError` when `episodes < 50`.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U23 | `test_probe_on_base_raises` | `probe_reward_hacking("base", 200)`. | Raises `ProbeOnBaseModelError`; no call to `training.eval`. |
| U24 | `test_probe_insufficient_samples_raises` | `probe_reward_hacking(Path("/ckpt"), 49)`. | Raises `ProbeInsufficientSamplesError` with substring `"n < 50"`. |
### 1.9 Catalogue hash + budget — `test_catalogue_hash_mismatch.py`, `test_eval_budget_exceeded.py`
**Scope:** evaluation.md §3.1 (catalogue pinning) and §3.8 (wall-clock ceilings 20 min / 60 min / 2 min, raising `EvalBudgetExceededError`).
| # | Name | Setup | Assertion |
|---|---|---|---|
| U25 | `test_catalogue_hash_mismatch_blocks_eval` | Fixture `BriefRow.catalogue_hash` set to `"stale"`; currently-loaded yaml hashes to `"current"`. Call `run_eval`. | Raises `CatalogueHashMismatchError` before any rollout; no `training.eval` call. |
| U26 | `test_run_eval_budget_20min_exceeded_raises` | Monkeypatch `time.monotonic` to simulate 20 min 1 s elapsed. | Raises `EvalBudgetExceededError` with substring `"run_eval"` and `"20 min"`. |
| U27 | `test_probe_budget_60min_and_plot_budget_2min` | Parametrized across `(probe_reward_hacking, 60*60+1)` and `(render_plots, 120+1)`. | Both raise `EvalBudgetExceededError` naming their respective ceiling. |
---
## 2. Property tests
Property tests use `hypothesis.strategies` to generate arbitrary-but-bounded inputs and assert invariants evaluation.md commits to. Each strategy is seeded (`hypothesis.seed(20260426)`) so failures are reproducible.
**Property inventory — 6 properties total (exceeds the ≥ 5 requirement).**
### 2.1 Eval purity — same (model, 50-ep) produces byte-identical `EvalReport`
```python
@given(checkpoint=st.sampled_from([Path("/ckpt/a"), Path("/ckpt/b")]))
@settings(max_examples=10, deadline=None)
def test_eval_is_pure(checkpoint, stubbed_training_eval):
r1 = run_eval(checkpoint, episodes=50)
r2 = run_eval(checkpoint, episodes=50)
assert serialize(r1) == serialize(r2)
```
**Invariant:** For the same checkpoint and the same 50-row slice, two back-to-back `run_eval` calls produce `EvalReport` records whose canonical JSON (`sort_keys=True, separators=(",", ":")`) byte-compares equal, including every `r{i}_mean_ci` tuple and every WandB-independent `curves` entry. This is evaluation.md §1 "Deterministic on re-run" invariant, bound to the fixed `rng_seed=20260426` in `bootstrap_ci`.
### 2.2 Bootstrap CI convergence — jitter ≤ 0.001 at `n_boot=10_000`
```python
@given(samples=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False),
min_size=50, max_size=50))
@settings(max_examples=20, deadline=None)
def test_bootstrap_ci_converges(samples):
m1, lo1, hi1 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
m2, lo2, hi2 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426 + 1)
assert abs(lo1 - lo2) <= 0.001
assert abs(hi1 - hi2) <= 0.001
```
**Invariant:** At `n_boot=10_000` the 2.5th / 97.5th percentile estimates differ by at most 0.001 across distinct bootstrap seeds on the same underlying samples — the Monte-Carlo jitter ceiling referenced by evaluation.md §3.3. This property defines "convergent enough" and guards against anyone lowering `n_boot` silently.
### 2.3 Paired-diff CI equals final − baseline per episode
```python
@given(b=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50),
f=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50))
@settings(max_examples=30, deadline=None)
def test_paired_diff_ci_is_paired(b, f):
pd_mean, _, _ = paired_difference_ci(tuple(b), tuple(f),
n_boot=10_000, rng_seed=20260428)
expected = sum(fi - bi for bi, fi in zip(b, f)) / 50
assert math.isclose(pd_mean, expected, abs_tol=1e-9)
```
**Invariant:** The paired-difference mean is the per-index `final[i] - baseline[i]` arithmetic mean — not the difference of independent means. Guards against anyone accidentally computing `mean(final) - mean(baseline)` via two unlinked samples (evaluation.md §2.4 `paired_difference_ci` docstring + §7 edge case 7).
### 2.4 Paired-diff CI requires equal lengths
```python
@given(n_b=st.integers(min_value=1, max_value=100),
n_f=st.integers(min_value=1, max_value=100))
def test_paired_diff_requires_equal_lengths(n_b, n_f):
if n_b == n_f:
return # happy path covered elsewhere
with pytest.raises(EpisodeSetLeakError):
paired_difference_ci(tuple([0.1] * n_b), tuple([0.2] * n_f))
```
**Invariant:** Unequal lengths raise `EpisodeSetLeakError` — pairing is strictly index-aligned (evaluation.md §2.4).
### 2.5 Bootstrap CI bracketing — `lo ≤ mean ≤ hi`
```python
@given(samples=st.lists(st.floats(min_value=-1.0, max_value=1.0, allow_nan=False),
min_size=2, max_size=200))
def test_bootstrap_ci_brackets_mean(samples):
m, lo, hi = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
assert lo <= m <= hi
```
**Invariant:** The percentile bootstrap CI always brackets the point estimate. Guards against transposed percentile extraction (e.g., accidentally returning 97.5 as `lo`).
### 2.6 Probe class closure — every emitted class is either known or appears in `novel_classes`
```python
@given(offense_codes=st.lists(st.text(min_size=1, max_size=32), min_size=0, max_size=40))
def test_probe_class_closure(offense_codes, stubbed_training_eval_with_offenses):
report = probe_reward_hacking(Path("/ckpt"), episodes=200)
for row in report.per_class:
assert row.exploit_class in EXPLOIT_CLASSES or row.exploit_class in report.novel_classes
```
**Invariant:** Every `ProbeExploitClassSummary.exploit_class` value in `report.per_class` is either one of the 5 known classes or is listed verbatim in `report.novel_classes`. No summary row exists outside this closure — protects the discovery channel.
---
## 3. Integration tests
Integration tests exercise cross-module wiring end-to-end. Real `training.train.eval` is **stubbed** (no CUDA), but the full evaluation pipeline — `run_eval → EvalReport → render_plots` and `probe_reward_hacking → scan → render_probe_report_md` — runs on actual dataclass instances.
File: `tests/test_evaluation/test_integration.py`.
### 3.1 Baseline eval on base model (stubbed)
```
test_integration_baseline_on_base_model
Setup: stubbed_training_eval returns baseline_eval_report_fixture.
eval_50_episodes_val_slice loaded from fixtures/val_briefs_50.jsonl.
Action: run_eval("base", 50)
Assert: EvalReport.model_path == "base"
EvalReport.n_episodes == 50
len(r1_mean_ci) == 3 and lo <= mean <= hi for every r{i}_mean_ci
breakdown["episode_ids"] == tuple of 50 episode_ids from val[0:50]
JSON round-trip through canonical serializer is byte-stable.
```
### 3.2 Final eval on trained LoRA (stubbed)
```
test_integration_final_on_trained_lora
Setup: stubbed_training_eval returns final_eval_report_fixture.
baseline.json already present on disk (from §3.1).
Action: run_eval(Path("/fake/ckpt/stage3_final"), 50)
then post-run guard: assert baseline.episode_ids == final.episode_ids
Assert: EvalReport.model_path == "/fake/ckpt/stage3_final"
EvalReport.reward_mean_ci[0] > baseline.reward_mean_ci[0]
paired_difference_ci stored under breakdown["paired_ci"]
No EpisodeSetLeakError raised.
```
### 3.3 Probe 200 episodes → markdown report
```
test_integration_probe_200_episodes_produces_markdown
Setup: stubbed_training_eval returns 200 Rewards records with:
- 2 hallucinated_field offenses
- 1 bare_drift_claim offense
- 0 of the other 3 classes
Action: report = probe_reward_hacking(Path("/fake/ckpt"), 200)
md_path = render_probe_report_md(report, tmp_path / "probe_report.md")
Assert: md_path.exists()
md_text.count("### ") == 5 # all 5 known classes present
"Novel exploit classes: none" in md_text
"**Total offenses:** 3" in md_text
json_round_trip(report) bytes-stable
probe_report.json passes schema validation.
```
### 3.4 Plot rendering for all 4 target curves
```
test_integration_render_all_4_plots
Setup: baseline_eval_report_fixture, final_eval_report_fixture,
stubbed WandB run-history returning R{1..5}_mean per step and
eval/drift_latency_p50+p95 at steps {50,100,150,200,300,400,500}.
Action: paths = render_plots(baseline, final, wandb_run_id="stub-run",
out_dir=tmp_path)
Assert: set(paths.keys()) == {
"per_reward_stack", "drift_latency_vs_step",
"per_language_bars", "before_after_bars"
}
every path .exists() and .stat().st_size > 1024 bytes
PIL.Image.open(path).size == (1600, 900) # canonical figsize
pytest-mpl snapshot compare passes with rms < 0.5 per plot.
```
### 3.5 WandB unavailable — graceful degrade (2 plots)
```
test_integration_render_plots_without_wandb
Setup: render_plots(..., wandb_run_id=None)
Assert: set(paths.keys()) == {"per_language_bars", "before_after_bars"}
WandBHistoryUnavailableWarning emitted (captured via pytest.warns)
no PlotRenderError raised
returned dict omits the two history-driven plots.
```
### 3.6 GPU delegation path (skipped on CPU-only CI)
```
@pytest.mark.cuda
test_integration_real_training_eval_delegation
Setup: real Gemma 3n E2B + toy LoRA adapter on a 2-episode smoke slice.
Assert: run_eval returns a valid EvalReport; no exception.
```
Skipped automatically when `torch.cuda.is_available() is False`.
---
## 4. Coverage target
**Line coverage:** 100% on each of:
- `training/eval_baseline.py`
- `training/eval_final.py`
- `training/probe.py` (aliased for `training/probe_reward_hacking.py`)
- `training/plots.py`
**Branch coverage:** ≥ 95% on the same files. Exclusions (via `# pragma: no cover` with justification comment):
1. `if TYPE_CHECKING:` import blocks.
2. Real-CUDA-only branches inside `training.eval` delegation that the `stubbed_training_eval` fixture bypasses (marked with `# pragma: no cover-stubbed`).
3. `matplotlib` backend-selection dead code paths (`backend == "Agg"` already forced at module import).
**Verification command:**
```
pytest tests/test_evaluation/ \
--cov=training.eval_baseline \
--cov=training.eval_final \
--cov=training.probe \
--cov=training.plots \
--cov-branch \
--cov-fail-under=100 \
--cov-report=term-missing
```
Branch coverage is independently enforced via `--cov-branch` against a local threshold file `.coveragerc` that sets `fail_under_branch = 95`.
**Per-file test → line mapping (authoritative):**
| File | Covering test file(s) | Targeted lines |
|---|---|---|
| `training/eval_baseline.py` | `test_run_eval_signature.py`, `test_episode_selection_deterministic.py`, `test_sampling_policy_frozen.py`, `test_zero_success_baseline.py`, `test_eval_budget_exceeded.py` | CLI argparse, `run_eval("base", …)` call, baseline.json write, ZeroSuccessBaselineWarning path. |
| `training/eval_final.py` | `test_run_eval_signature.py`, `test_episode_set_leak_error.py`, `test_aggregation_bootstrap_ci.py`, `test_plot_rendering.py`, `test_plot_graceful_degrade.py` | CLI, `run_eval(ckpt, …)`, paired-diff CI store, `render_plots` call, EpisodeSetLeakError guard at exit. |
| `training/probe.py` | `test_probe_scanner_mechanics.py`, `test_probe_novel_class.py`, `test_probe_on_base_guard.py`, `test_probe_insufficient_samples.py`, `test_render_probe_report_md.py`, `test_probe_report_json_roundtrip.py`, `test_exploit_classes_always_emitted.py` | `probe_reward_hacking`, `scan_episode_for_exploits`, `render_probe_report_md`, JSON serializer, all 5 guard branches. |
| `training/plots.py` | `test_plot_rendering.py`, `test_plot_graceful_degrade.py` | All 4 plot functions, WandB history fetcher, graceful-degrade branching, `PlotRenderError` path. |
---
## 5. Fixtures
All fixtures live in `tests/conftest.py` (repo-root shared scope) so the 4 shared names are identical content across `training_tests.md`, `pitch_demo_tests.md`, `risk_book_tests.md`, and this plan. Fixture IDs are the function names registered via `@pytest.fixture`.
### 5.1 `eval_50_episodes_val_slice`
**Shape:** `tuple[BriefRow, ...]` of length 50. Sourced from `tests/fixtures/val_briefs_50.jsonl` — the first 50 rows of a publication seed of `val/briefs.jsonl` (datasets.md §4.7), committed verbatim (≤ 120 KiB). Each row carries `episode_id`, `seed`, `catalogue_hash`, `templates_sha256`, `i18n_sha256`, `goal.language ∈ {hi, ta, kn, en, hinglish}`, and the embedded `GoalSpec`.
**Shared with:** `training_tests.md` (imported under the same canonical name `eval_50_episodes_val_slice` for `EpisodeDatasetAdapter` iteration in integration test 3.4 — see `training_tests.md §5` footer cross-reference), `pitch_demo_tests.md` (demo rollout harness), `risk_book_tests.md` (risk-book sample plots). **This plan is the sole definition site**; all consumers import from `tests/conftest.py`.
### 5.2 `baseline_eval_report_fixture`
**Shape:** A hand-built `EvalReport` mirroring the numbers in evaluation.md §8.1 verbatim:
- `model_path = "base"`, `n_episodes = 50`
- `reward_mean_ci = (0.118, 0.086, 0.152)`, `r1_mean_ci = (0.100, 0.040, 0.180)`, etc.
- 5 `PerLanguageReport`s (hi n=11, ta n=10, kn n=9, en n=10, hinglish n=10)
- `drift_detection_latency` with stage2/stage3 all NaN, `undetected_count=27`
- `breakdown["episode_ids"] = tuple_of_50_ids` deterministic
- `reward_hacking_offenses = {"hallucinated_field": 7, ...}` per §8.1
**Used by:** integration §3.1, §3.4, §3.5. **Eval-only** — not shared with `training_tests.md` (which constructs `EvalReport`s inline via stubbed `training.eval`).
### 5.3 `final_eval_report_fixture`
**Shape:** Hand-built `EvalReport` matching evaluation.md §8.2:
- `model_path = "/abs/path/checkpoints/stage3_final"`, `n_episodes = 50`
- `reward_mean_ci = (0.542, 0.480, 0.604)`, etc.
- `drift_detection_latency` with stage2_mean=1.2, stage3_mean=1.6, undetected_count=9
- `curves` dict with the 4 keys enumerated in §8.2
- `breakdown["paired_ci"]` populated with ΔR1, ΔR2, Δreward_mean, Δdrift_latency triples
- Shares `episode_ids` tuple with §5.2 (paired-comparison invariant)
**Used by:** integration §3.2, §3.4, §3.5. **Eval-only** — not shared with `training_tests.md`.
### 5.4 `probe_report_no_exploits`
**Shape:** `ProbeReport` with `n_episodes=200`, `per_class` containing all 5 known classes at `count=0`, `rate=0.0`, `example_episode_id=None`, `total_hits=0`, `novel_classes=()`. Generated from a stubbed `training.eval` that returns 200 `Rewards` records with empty `anti_hack.offenses` lists.
**Used by:** integration §3.3 (happy-path markdown rendering), `pitch_demo_tests.md` (probe-artefact badge I12). Not consumed by `risk_book_tests.md` (risk-book's domain is `Risk.triage` / risk register, not exploit reports).
### 5.5 `probe_report_with_novel_class`
**Shape:** `ProbeReport` with `n_episodes=200`, `per_class` containing all 5 known classes (zeros) **plus** a sixth `ProbeExploitClassSummary` for `exploit_class="zero_width_evasion"` with `count=1`, `rate=0.005`, `example_episode_id="s3_ep_00000131"`. `novel_classes=("zero_width_evasion",)`. Generated from a stubbed `training.eval` that seeds a single `offense.code="zero_width_evasion"` into one episode's `Rewards.breakdown.anti_hack.offenses`.
**Used by:** novel-class unit test (§1.7 U22). Eval-only — not consumed by other plans.
### 5.6 Cross-plan sharing contract
This plan (`evaluation_tests.md`) is the **sole definition site** for all 5 fixtures below. Consumers import from `tests/conftest.py`; they MUST NOT redefine. `✅` = consumed (imported from conftest and actually referenced in the consuming plan). `—` = not consumed by that plan.
| Fixture | Defined in | evaluation_tests.md | training_tests.md | pitch_demo_tests.md | risk_book_tests.md |
|---|---|---|---|---|---|
| `eval_50_episodes_val_slice` | evaluation_tests.md §5.1 | ✅ consumed (definer) | ✅ consumed (integration §3.4 paired eval; see training_tests.md §5.6 footer cross-reference) | — (no eval slice needed by pitch demo tests) | — (risk-register domain, no eval slice needed) |
| `baseline_eval_report_fixture` | evaluation_tests.md §5.2 | ✅ consumed (definer) | — (eval-only; training stubs `training.eval` directly) | — | — |
| `final_eval_report_fixture` | evaluation_tests.md §5.3 | ✅ consumed (definer) | — (eval-only) | — | — |
| `probe_report_no_exploits` | evaluation_tests.md §5.4 | ✅ consumed (definer) | — (probe entry-point tested via direct mocks in U45) | ✅ consumed (I12 blog badge) | — (risk register is about `Risk.triage`, not exploit reports) |
| `probe_report_with_novel_class` | evaluation_tests.md §5.5 | ✅ consumed (definer) | — | — | — |
**Truth-verification rule:** every `✅ consumed` above is bidirectionally consistent — the consuming plan's own §5 either (a) does not re-define the fixture, AND (b) explicitly cross-references this section as the definition site. The only cross-plan consumer rows are `training_tests.md` for `eval_50_episodes_val_slice` (see `training_tests.md §5.6`) and `pitch_demo_tests.md` for `probe_report_no_exploits` (see `pitch_demo_tests.md §5.5`). All other cells are `—`.
If a downstream plan needs to mutate any of these fixtures, it must define a derived fixture (e.g., `@pytest.fixture def probe_report_no_exploits_for_demo(probe_report_no_exploits): ...`) rather than editing the shared body. Enforced by a `tests/conftest_lock.py` sha256 check over the 5 fixture sources.
---
**End of evaluation_tests.md.** This plan is sealed pending ≥ 2 fresh critic `NOTHING_FURTHER` returns per `DRIFTCALL/CLAUDE.md` §3.4.
|