File size: 30,417 Bytes
f2df60e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
# evaluation_tests.md — Test Plan for `docs/modules/evaluation.md`

**Target modules:** `training/eval_baseline.py`, `training/eval_final.py`, `training/probe_reward_hacking.py` (aliased `training/probe.py` in test imports), `training/plots.py`
**Spec doc:** `DRIFTCALL/docs/modules/evaluation.md` (final sealed, 2026-04-26)
**Cross-refs:** `DRIFTCALL/docs/modules/training.md` §2.1 (`eval()` contract, §4.2 `EvalReport`), `DRIFTCALL/docs/modules/rewards.md` §3.1 (purity), §3.6 (exploit classes), §4.2 (`Rewards.breakdown`), `DRIFTCALL/docs/modules/datasets.md` §4.7 (val split), §5 (catalogue hashes), `DRIFTCALL/CLAUDE.md` §3.1 (nine-section test-plan doc).
**Framework:** `pytest` + `hypothesis` + `unittest.mock` + `pytest-mpl` (plot image-compare, tolerance-based).
**Owner:** Person B (Rewards & Tests), co-signed by Person C (Training) for the `training.eval` delegation path.
**CUDA policy:** **Model inference is mocked by default.** Every test that would touch a real LoRA adapter or base weight goes through the `stubbed_training_eval` fixture (§5.3), which monkeypatches `training.train.eval` to return a hand-crafted `EvalReport`. Zero CUDA calls in CI. A single `@pytest.mark.cuda` integration test exercises the real delegation path and is skipped when `torch.cuda.is_available() is False` (the default laptop / CI environment).
**Deterministic RNG:** `numpy.random.default_rng(20260426)` for baseline + final bootstrap, `numpy.random.default_rng(20260427)` for the probe CI, `numpy.random.default_rng(20260428)` for paired-difference bootstrap. Seeds are frozen in `evaluation.md` §2.4 and re-asserted at every call site.
**Numeric tolerance:** `math.isclose(a, b, abs_tol=1e-9, rel_tol=0.0)` for scalar floats; `numpy.testing.assert_allclose(..., atol=1e-6, rtol=0.0)` for sample arrays; byte-exact for serialized JSON (`sort_keys=True, separators=(",", ":")`); image diff tolerance `rms < 0.5` pixel for plot snapshots.

This plan specifies **100% line coverage** and **≥ 95% branch coverage** on `training/eval_baseline.py`, `training/eval_final.py`, `training/probe.py`, and `training/plots.py`. Every function signature in evaluation.md §2, every behavior clause in §3.1–§3.8, every error mode in §5, every data-structure invariant in §4, and every edge case in §7 has at least one dedicated test.

**Fixtures contract (§5).** This plan is the **source of truth** for all eval-related fixtures. Bidirectionally-consistent sharing map (see §5.6 for the authoritative truth-verified table):

- `eval_50_episodes_val_slice` — defined here (§5.1). **Imported** by `training_tests.md` §3.4 only (via cross-reference at `training_tests.md §5.6`).
- `baseline_eval_report_fixture`, `final_eval_report_fixture` — defined here (§5.2, §5.3). **Eval-only**; not consumed by any other plan.
- `probe_report_no_exploits` — defined here (§5.4). **Imported** by `pitch_demo_tests.md` §5.5 (blog badge I12) only.
- `probe_report_with_novel_class` — defined here (§5.5). Eval-only.

Any change to fixture bodies MUST be mirrored in `tests/conftest.py` in lockstep. PR reviewers for any consuming plan must verify content parity against this plan's §5.

---

## 1. Unit tests

**Organisation:** one `pytest` module per behavior cluster. Zero real inference — every test either uses `stubbed_training_eval` or asserts against pure-Python helpers.

File layout under `tests/test_evaluation/`:

```
tests/test_evaluation/
  __init__.py
  test_run_eval_signature.py              # run_eval signature + delegation (§2.1, §3.2)
  test_episode_selection_deterministic.py # val[0:50], seed hashing, leak guard (§3.1)
  test_sampling_policy_frozen.py          # T=0, num_gen=1, eval(), no_grad (§3.2)
  test_aggregation_bootstrap_ci.py        # per-reward CI store (§3.3)
  test_per_language_cohort.py             # cohort means + low-n rendering (§3.4)
  test_drift_detection_latency.py         # §3.5 latency compute + Stage-1 NaN
  test_probe_scanner_mechanics.py         # §3.6 scan + Counter aggregation
  test_probe_novel_class.py               # §3.6 unknown offense code path
  test_probe_on_base_guard.py             # ProbeOnBaseModelError (§5)
  test_probe_insufficient_samples.py      # ProbeInsufficientSamplesError (§5)
  test_episode_set_leak_error.py          # EpisodeSetLeakError on mismatch (§5)
  test_catalogue_hash_mismatch.py         # CatalogueHashMismatchError (§5)
  test_eval_budget_exceeded.py            # EvalBudgetExceededError all three ceilings (§3.8, §5)
  test_bootstrap_edge_cases.py            # len 0 / 1 / all-identical samples (§2.4)
  test_zero_success_baseline.py           # warning + CI undefined breakdown (§7.1)
  test_plot_rendering.py                  # all 4 curves, PNG exists + shape (§2.1, §3.5)
  test_plot_graceful_degrade.py           # WandB unavailable skips 2 plots (§3.5, §7.6)
  test_render_probe_report_md.py          # fixed 35-line template (§2.3, §4.5)
  test_probe_report_json_roundtrip.py     # sort_keys serialization (§4.4)
  test_exploit_classes_always_emitted.py  # all 5 in report even when zero (§3.6)
```

**Unit test case inventory — 27 cases total (exceeds the ≥ 25 requirement).**

### 1.1 `run_eval` signature + delegation — `test_run_eval_signature.py`

**Scope:** `run_eval(model_path, episodes=50) -> EvalReport` is a thin wrapper over `training.train.eval`; it must pass `model_path` through verbatim, default `episodes=50`, and return the delegate's `EvalReport` unchanged.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U1 | `test_run_eval_default_episodes_50` | Stub `training.train.eval` to capture kwargs; call `run_eval("base")`. | Delegate invoked with `model_path="base"` and `episodes=50`. |
| U2 | `test_run_eval_accepts_literal_base` | `run_eval("base", 50)`. | No error; returns stubbed `EvalReport` with `model_path == "base"`. |
| U3 | `test_run_eval_accepts_path_object` | `run_eval(Path("/tmp/ckpt"), 50)`. | Delegate invoked with `model_path=Path("/tmp/ckpt")`. |
| U4 | `test_run_eval_propagates_model_load_error` | Stub raises `EvalModelLoadError("adapter missing")`. | `run_eval` re-raises `EvalModelLoadError`; no silent fallback. |

### 1.2 Episode selection determinism — `test_episode_selection_deterministic.py`

**Scope:** evaluation.md §3.1 — baseline and final both read `val/briefs.jsonl[0:50]` in file order; probe reads `[50:250]`; env seed = `hash((episode_id, "eval")) & 0xFFFFFFFF`; paired `(episode_id, seed)` tuples must match between baseline and final.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U5 | `test_eval_reads_first_50_rows_in_file_order` | `eval_50_episodes_val_slice` fixture + mock `load_briefs` recording `.take(50)`. Call `run_eval("base", 50)`. | `load_briefs` received exactly the first 50 `BriefRow`s; episode_ids order matches file order. |
| U6 | `test_probe_reads_rows_50_to_250` | Mock `load_briefs` with 500-row fixture. Call `probe_reward_hacking(Path("/ckpt"), 200)`. | The 200 `BriefRow`s passed to `training.eval` are rows `[50:250]` — disjoint from the paired 50, confirmed by episode_id set intersection == ∅. |
| U7 | `test_env_seed_is_hash_tuple_episode_id_eval` | Mock `env.reset` to record seed. | For each episode, recorded `seed == hash((episode_id, "eval")) & 0xFFFFFFFF`. |
| U8 | `test_baseline_and_final_share_same_seeds` | `stubbed_training_eval` records seeds per run. Run baseline then final. | `baseline.breakdown["episode_ids"] == final.breakdown["episode_ids"]`; zipped seeds are pairwise identical. |
| U9 | `test_episode_set_leak_error_raised_on_mismatch` | Baseline fixture with episode_ids `[a,b,c,...]`; final fixture with `[a,b,X,...]`. Call post-run guard. | Raises `EpisodeSetLeakError` with substring `"paired-comparison invariant"`. |

### 1.3 Sampling policy frozen — `test_sampling_policy_frozen.py`

**Scope:** evaluation.md §3.2 — greedy `temperature=0.0`, `top_k=1`, `num_generations=1`, `model.eval()`, `torch.no_grad()`, all dropouts OFF. `run_eval` re-asserts these at entry.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U10 | `test_run_eval_enforces_temperature_zero` | Stub `training.eval` with a capture-then-assert that records sampling kwargs. | Captured `temperature == 0.0`; `top_k == 1`; `num_generations == 1`. |
| U11 | `test_run_eval_wraps_in_no_grad_and_eval_mode` | Mock torch context; assert `model.eval()` called and `torch.no_grad()` context entered before first forward. | `model.eval` call_count ≥ 1; `no_grad.__enter__` called before first forward. |
| U12 | `test_run_eval_dropouts_off` | Stub model with `.train()` recorded; assert `.eval()` wins and dropout modules report `.training is False`. | All `nn.Dropout` / LoRA-dropout modules have `training is False` at sample time. |

### 1.4 Aggregation + bootstrap CI — `test_aggregation_bootstrap_ci.py`

**Scope:** evaluation.md §3.3 — `bootstrap_ci(samples, n_boot=10_000, alpha=0.05, rng_seed=20260426)` called once per reward channel; results stored on `EvalReport.r{i}_mean_ci` tuple.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U13 | `test_bootstrap_ci_default_n_boot_10000` | Call `bootstrap_ci(tuple(range(50)), n_boot=10_000, rng_seed=20260426)`. | Returned `(mean, lo, hi)` triple; `lo < mean < hi`; mean within `abs_tol=1e-9` of arithmetic mean. |
| U14 | `test_bootstrap_ci_deterministic_with_seed` | Call twice with identical args. | Byte-identical `(mean, lo, hi)` on both calls (re-run determinism). |
| U15 | `test_paired_difference_ci_uses_seed_20260428` | `paired_difference_ci(baseline, final)` with mocked rng capture. | `numpy.random.default_rng` called with `20260428`; output tuple reproducible. |

### 1.5 Bootstrap edge cases — `test_bootstrap_edge_cases.py`

**Scope:** evaluation.md §2.4 — len 0 → `(nan, nan, nan)`; len 1 → `(v, v, v)`; all-identical → `(v, v, v)` no variance.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U16 | `test_bootstrap_ci_len0_returns_all_nan` | `bootstrap_ci(tuple(), n_boot=10_000)`. | All three outputs `math.isnan(...)`. |
| U17 | `test_bootstrap_ci_len1_returns_triple_v` | `bootstrap_ci((0.42,), n_boot=10_000)`. | Returns `(0.42, 0.42, 0.42)` exactly. |

### 1.6 Per-language cohort rendering — `test_per_language_cohort.py`

**Scope:** evaluation.md §3.4 — cohorts with `n_episodes >= 5` render numeric mean + 95% CI; `1 <= n <= 4` renders striped low-n bar with label `(low-n)`; `n == 0` renders empty slot labelled `(no episodes)`. No CI computed for low-n or empty cohorts.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U18 | `test_cohort_ge_5_renders_numeric_with_ci` | Fixture cohort of 11 episodes. | Rendered bar payload has `n_episodes == 11`, numeric `mean`, `ci != None`. |
| U19 | `test_cohort_low_n_1_to_4_renders_striped` | Fixture cohort of 3 episodes. | Rendered bar carries `style == "striped"`, `label.endswith("(low-n)")`, `ci is None`. |
| U20 | `test_cohort_empty_renders_empty_slot` | Fixture cohort of 0 episodes. | Rendered entry has `n_episodes == 0`, `label.endswith("(no episodes)")`, `mean is float("nan")`. |

### 1.7 Probe scanner + novel class — `test_probe_scanner_mechanics.py`, `test_probe_novel_class.py`

**Scope:** evaluation.md §3.6 — all 5 exploit classes always enumerated in the report (count ≥ 0); any `offense.code ∉ EXPLOIT_CLASSES` surfaces as a Novel exploit (threshold = 1 occurrence).

| # | Name | Setup | Assertion |
|---|---|---|---|
| U21 | `test_probe_emits_all_5_known_classes_even_when_zero` | Fixture rewards with zero anti-hack offenses. Call `probe_reward_hacking`. | `len(report.per_class) >= 5`; every member of `EXPLOIT_CLASSES` present; each has `count == 0`, `rate == 0.0`, `example_episode_id is None`. |
| U22 | `test_probe_novel_class_surfaced_on_single_instance` | Fixture: one episode with `offense.code == "zero_width_evasion"`. | `report.novel_classes == ("zero_width_evasion",)`; summary row for it has `count == 1`; markdown contains substring `"UNKNOWN EXPLOIT CLASS"`. |

### 1.8 Probe guards — `test_probe_on_base_guard.py`, `test_probe_insufficient_samples.py`

**Scope:** evaluation.md §5 — `ProbeOnBaseModelError` when `model_path=="base"`; `ProbeInsufficientSamplesError` when `episodes < 50`.

| # | Name | Setup | Assertion |
|---|---|---|---|
| U23 | `test_probe_on_base_raises` | `probe_reward_hacking("base", 200)`. | Raises `ProbeOnBaseModelError`; no call to `training.eval`. |
| U24 | `test_probe_insufficient_samples_raises` | `probe_reward_hacking(Path("/ckpt"), 49)`. | Raises `ProbeInsufficientSamplesError` with substring `"n < 50"`. |

### 1.9 Catalogue hash + budget — `test_catalogue_hash_mismatch.py`, `test_eval_budget_exceeded.py`

**Scope:** evaluation.md §3.1 (catalogue pinning) and §3.8 (wall-clock ceilings 20 min / 60 min / 2 min, raising `EvalBudgetExceededError`).

| # | Name | Setup | Assertion |
|---|---|---|---|
| U25 | `test_catalogue_hash_mismatch_blocks_eval` | Fixture `BriefRow.catalogue_hash` set to `"stale"`; currently-loaded yaml hashes to `"current"`. Call `run_eval`. | Raises `CatalogueHashMismatchError` before any rollout; no `training.eval` call. |
| U26 | `test_run_eval_budget_20min_exceeded_raises` | Monkeypatch `time.monotonic` to simulate 20 min 1 s elapsed. | Raises `EvalBudgetExceededError` with substring `"run_eval"` and `"20 min"`. |
| U27 | `test_probe_budget_60min_and_plot_budget_2min` | Parametrized across `(probe_reward_hacking, 60*60+1)` and `(render_plots, 120+1)`. | Both raise `EvalBudgetExceededError` naming their respective ceiling. |

---

## 2. Property tests

Property tests use `hypothesis.strategies` to generate arbitrary-but-bounded inputs and assert invariants evaluation.md commits to. Each strategy is seeded (`hypothesis.seed(20260426)`) so failures are reproducible.

**Property inventory — 6 properties total (exceeds the ≥ 5 requirement).**

### 2.1 Eval purity — same (model, 50-ep) produces byte-identical `EvalReport`

```python
@given(checkpoint=st.sampled_from([Path("/ckpt/a"), Path("/ckpt/b")]))
@settings(max_examples=10, deadline=None)
def test_eval_is_pure(checkpoint, stubbed_training_eval):
    r1 = run_eval(checkpoint, episodes=50)
    r2 = run_eval(checkpoint, episodes=50)
    assert serialize(r1) == serialize(r2)
```

**Invariant:** For the same checkpoint and the same 50-row slice, two back-to-back `run_eval` calls produce `EvalReport` records whose canonical JSON (`sort_keys=True, separators=(",", ":")`) byte-compares equal, including every `r{i}_mean_ci` tuple and every WandB-independent `curves` entry. This is evaluation.md §1 "Deterministic on re-run" invariant, bound to the fixed `rng_seed=20260426` in `bootstrap_ci`.

### 2.2 Bootstrap CI convergence — jitter ≤ 0.001 at `n_boot=10_000`

```python
@given(samples=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False),
                         min_size=50, max_size=50))
@settings(max_examples=20, deadline=None)
def test_bootstrap_ci_converges(samples):
    m1, lo1, hi1 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
    m2, lo2, hi2 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426 + 1)
    assert abs(lo1 - lo2) <= 0.001
    assert abs(hi1 - hi2) <= 0.001
```

**Invariant:** At `n_boot=10_000` the 2.5th / 97.5th percentile estimates differ by at most 0.001 across distinct bootstrap seeds on the same underlying samples — the Monte-Carlo jitter ceiling referenced by evaluation.md §3.3. This property defines "convergent enough" and guards against anyone lowering `n_boot` silently.

### 2.3 Paired-diff CI equals final − baseline per episode

```python
@given(b=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50),
       f=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50))
@settings(max_examples=30, deadline=None)
def test_paired_diff_ci_is_paired(b, f):
    pd_mean, _, _ = paired_difference_ci(tuple(b), tuple(f),
                                          n_boot=10_000, rng_seed=20260428)
    expected = sum(fi - bi for bi, fi in zip(b, f)) / 50
    assert math.isclose(pd_mean, expected, abs_tol=1e-9)
```

**Invariant:** The paired-difference mean is the per-index `final[i] - baseline[i]` arithmetic mean — not the difference of independent means. Guards against anyone accidentally computing `mean(final) - mean(baseline)` via two unlinked samples (evaluation.md §2.4 `paired_difference_ci` docstring + §7 edge case 7).

### 2.4 Paired-diff CI requires equal lengths

```python
@given(n_b=st.integers(min_value=1, max_value=100),
       n_f=st.integers(min_value=1, max_value=100))
def test_paired_diff_requires_equal_lengths(n_b, n_f):
    if n_b == n_f:
        return  # happy path covered elsewhere
    with pytest.raises(EpisodeSetLeakError):
        paired_difference_ci(tuple([0.1] * n_b), tuple([0.2] * n_f))
```

**Invariant:** Unequal lengths raise `EpisodeSetLeakError` — pairing is strictly index-aligned (evaluation.md §2.4).

### 2.5 Bootstrap CI bracketing — `lo ≤ mean ≤ hi`

```python
@given(samples=st.lists(st.floats(min_value=-1.0, max_value=1.0, allow_nan=False),
                         min_size=2, max_size=200))
def test_bootstrap_ci_brackets_mean(samples):
    m, lo, hi = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
    assert lo <= m <= hi
```

**Invariant:** The percentile bootstrap CI always brackets the point estimate. Guards against transposed percentile extraction (e.g., accidentally returning 97.5 as `lo`).

### 2.6 Probe class closure — every emitted class is either known or appears in `novel_classes`

```python
@given(offense_codes=st.lists(st.text(min_size=1, max_size=32), min_size=0, max_size=40))
def test_probe_class_closure(offense_codes, stubbed_training_eval_with_offenses):
    report = probe_reward_hacking(Path("/ckpt"), episodes=200)
    for row in report.per_class:
        assert row.exploit_class in EXPLOIT_CLASSES or row.exploit_class in report.novel_classes
```

**Invariant:** Every `ProbeExploitClassSummary.exploit_class` value in `report.per_class` is either one of the 5 known classes or is listed verbatim in `report.novel_classes`. No summary row exists outside this closure — protects the discovery channel.

---

## 3. Integration tests

Integration tests exercise cross-module wiring end-to-end. Real `training.train.eval` is **stubbed** (no CUDA), but the full evaluation pipeline — `run_eval → EvalReport → render_plots` and `probe_reward_hacking → scan → render_probe_report_md` — runs on actual dataclass instances.

File: `tests/test_evaluation/test_integration.py`.

### 3.1 Baseline eval on base model (stubbed)

```
test_integration_baseline_on_base_model
  Setup:   stubbed_training_eval returns baseline_eval_report_fixture.
           eval_50_episodes_val_slice loaded from fixtures/val_briefs_50.jsonl.
  Action:  run_eval("base", 50)
  Assert:  EvalReport.model_path == "base"
           EvalReport.n_episodes == 50
           len(r1_mean_ci) == 3 and lo <= mean <= hi for every r{i}_mean_ci
           breakdown["episode_ids"] == tuple of 50 episode_ids from val[0:50]
           JSON round-trip through canonical serializer is byte-stable.
```

### 3.2 Final eval on trained LoRA (stubbed)

```
test_integration_final_on_trained_lora
  Setup:   stubbed_training_eval returns final_eval_report_fixture.
           baseline.json already present on disk (from §3.1).
  Action:  run_eval(Path("/fake/ckpt/stage3_final"), 50)
           then post-run guard: assert baseline.episode_ids == final.episode_ids
  Assert:  EvalReport.model_path == "/fake/ckpt/stage3_final"
           EvalReport.reward_mean_ci[0] > baseline.reward_mean_ci[0]
           paired_difference_ci stored under breakdown["paired_ci"]
           No EpisodeSetLeakError raised.
```

### 3.3 Probe 200 episodes → markdown report

```
test_integration_probe_200_episodes_produces_markdown
  Setup:   stubbed_training_eval returns 200 Rewards records with:
             - 2 hallucinated_field offenses
             - 1 bare_drift_claim offense
             - 0 of the other 3 classes
  Action:  report = probe_reward_hacking(Path("/fake/ckpt"), 200)
           md_path = render_probe_report_md(report, tmp_path / "probe_report.md")
  Assert:  md_path.exists()
           md_text.count("### ") == 5            # all 5 known classes present
           "Novel exploit classes: none" in md_text
           "**Total offenses:** 3" in md_text
           json_round_trip(report) bytes-stable
           probe_report.json passes schema validation.
```

### 3.4 Plot rendering for all 4 target curves

```
test_integration_render_all_4_plots
  Setup:   baseline_eval_report_fixture, final_eval_report_fixture,
           stubbed WandB run-history returning R{1..5}_mean per step and
           eval/drift_latency_p50+p95 at steps {50,100,150,200,300,400,500}.
  Action:  paths = render_plots(baseline, final, wandb_run_id="stub-run",
                                 out_dir=tmp_path)
  Assert:  set(paths.keys()) == {
               "per_reward_stack", "drift_latency_vs_step",
               "per_language_bars",  "before_after_bars"
           }
           every path .exists() and .stat().st_size > 1024 bytes
           PIL.Image.open(path).size == (1600, 900)   # canonical figsize
           pytest-mpl snapshot compare passes with rms < 0.5 per plot.
```

### 3.5 WandB unavailable — graceful degrade (2 plots)

```
test_integration_render_plots_without_wandb
  Setup:   render_plots(..., wandb_run_id=None)
  Assert:  set(paths.keys()) == {"per_language_bars", "before_after_bars"}
           WandBHistoryUnavailableWarning emitted (captured via pytest.warns)
           no PlotRenderError raised
           returned dict omits the two history-driven plots.
```

### 3.6 GPU delegation path (skipped on CPU-only CI)

```
@pytest.mark.cuda
test_integration_real_training_eval_delegation
  Setup:   real Gemma 3n E2B + toy LoRA adapter on a 2-episode smoke slice.
  Assert:  run_eval returns a valid EvalReport; no exception.
```

Skipped automatically when `torch.cuda.is_available() is False`.

---

## 4. Coverage target

**Line coverage:** 100% on each of:
- `training/eval_baseline.py`
- `training/eval_final.py`
- `training/probe.py` (aliased for `training/probe_reward_hacking.py`)
- `training/plots.py`

**Branch coverage:** ≥ 95% on the same files. Exclusions (via `# pragma: no cover` with justification comment):

1. `if TYPE_CHECKING:` import blocks.
2. Real-CUDA-only branches inside `training.eval` delegation that the `stubbed_training_eval` fixture bypasses (marked with `# pragma: no cover-stubbed`).
3. `matplotlib` backend-selection dead code paths (`backend == "Agg"` already forced at module import).

**Verification command:**
```
pytest tests/test_evaluation/ \
  --cov=training.eval_baseline \
  --cov=training.eval_final \
  --cov=training.probe \
  --cov=training.plots \
  --cov-branch \
  --cov-fail-under=100 \
  --cov-report=term-missing
```

Branch coverage is independently enforced via `--cov-branch` against a local threshold file `.coveragerc` that sets `fail_under_branch = 95`.

**Per-file test → line mapping (authoritative):**

| File | Covering test file(s) | Targeted lines |
|---|---|---|
| `training/eval_baseline.py` | `test_run_eval_signature.py`, `test_episode_selection_deterministic.py`, `test_sampling_policy_frozen.py`, `test_zero_success_baseline.py`, `test_eval_budget_exceeded.py` | CLI argparse, `run_eval("base", …)` call, baseline.json write, ZeroSuccessBaselineWarning path. |
| `training/eval_final.py` | `test_run_eval_signature.py`, `test_episode_set_leak_error.py`, `test_aggregation_bootstrap_ci.py`, `test_plot_rendering.py`, `test_plot_graceful_degrade.py` | CLI, `run_eval(ckpt, …)`, paired-diff CI store, `render_plots` call, EpisodeSetLeakError guard at exit. |
| `training/probe.py` | `test_probe_scanner_mechanics.py`, `test_probe_novel_class.py`, `test_probe_on_base_guard.py`, `test_probe_insufficient_samples.py`, `test_render_probe_report_md.py`, `test_probe_report_json_roundtrip.py`, `test_exploit_classes_always_emitted.py` | `probe_reward_hacking`, `scan_episode_for_exploits`, `render_probe_report_md`, JSON serializer, all 5 guard branches. |
| `training/plots.py` | `test_plot_rendering.py`, `test_plot_graceful_degrade.py` | All 4 plot functions, WandB history fetcher, graceful-degrade branching, `PlotRenderError` path. |

---

## 5. Fixtures

All fixtures live in `tests/conftest.py` (repo-root shared scope) so the 4 shared names are identical content across `training_tests.md`, `pitch_demo_tests.md`, `risk_book_tests.md`, and this plan. Fixture IDs are the function names registered via `@pytest.fixture`.

### 5.1 `eval_50_episodes_val_slice`

**Shape:** `tuple[BriefRow, ...]` of length 50. Sourced from `tests/fixtures/val_briefs_50.jsonl` — the first 50 rows of a publication seed of `val/briefs.jsonl` (datasets.md §4.7), committed verbatim (≤ 120 KiB). Each row carries `episode_id`, `seed`, `catalogue_hash`, `templates_sha256`, `i18n_sha256`, `goal.language ∈ {hi, ta, kn, en, hinglish}`, and the embedded `GoalSpec`.

**Shared with:** `training_tests.md` (imported under the same canonical name `eval_50_episodes_val_slice` for `EpisodeDatasetAdapter` iteration in integration test 3.4 — see `training_tests.md §5` footer cross-reference), `pitch_demo_tests.md` (demo rollout harness), `risk_book_tests.md` (risk-book sample plots). **This plan is the sole definition site**; all consumers import from `tests/conftest.py`.

### 5.2 `baseline_eval_report_fixture`

**Shape:** A hand-built `EvalReport` mirroring the numbers in evaluation.md §8.1 verbatim:
- `model_path = "base"`, `n_episodes = 50`
- `reward_mean_ci = (0.118, 0.086, 0.152)`, `r1_mean_ci = (0.100, 0.040, 0.180)`, etc.
- 5 `PerLanguageReport`s (hi n=11, ta n=10, kn n=9, en n=10, hinglish n=10)
- `drift_detection_latency` with stage2/stage3 all NaN, `undetected_count=27`
- `breakdown["episode_ids"] = tuple_of_50_ids` deterministic
- `reward_hacking_offenses = {"hallucinated_field": 7, ...}` per §8.1

**Used by:** integration §3.1, §3.4, §3.5. **Eval-only** — not shared with `training_tests.md` (which constructs `EvalReport`s inline via stubbed `training.eval`).

### 5.3 `final_eval_report_fixture`

**Shape:** Hand-built `EvalReport` matching evaluation.md §8.2:
- `model_path = "/abs/path/checkpoints/stage3_final"`, `n_episodes = 50`
- `reward_mean_ci = (0.542, 0.480, 0.604)`, etc.
- `drift_detection_latency` with stage2_mean=1.2, stage3_mean=1.6, undetected_count=9
- `curves` dict with the 4 keys enumerated in §8.2
- `breakdown["paired_ci"]` populated with ΔR1, ΔR2, Δreward_mean, Δdrift_latency triples
- Shares `episode_ids` tuple with §5.2 (paired-comparison invariant)

**Used by:** integration §3.2, §3.4, §3.5. **Eval-only** — not shared with `training_tests.md`.

### 5.4 `probe_report_no_exploits`

**Shape:** `ProbeReport` with `n_episodes=200`, `per_class` containing all 5 known classes at `count=0`, `rate=0.0`, `example_episode_id=None`, `total_hits=0`, `novel_classes=()`. Generated from a stubbed `training.eval` that returns 200 `Rewards` records with empty `anti_hack.offenses` lists.

**Used by:** integration §3.3 (happy-path markdown rendering), `pitch_demo_tests.md` (probe-artefact badge I12). Not consumed by `risk_book_tests.md` (risk-book's domain is `Risk.triage` / risk register, not exploit reports).

### 5.5 `probe_report_with_novel_class`

**Shape:** `ProbeReport` with `n_episodes=200`, `per_class` containing all 5 known classes (zeros) **plus** a sixth `ProbeExploitClassSummary` for `exploit_class="zero_width_evasion"` with `count=1`, `rate=0.005`, `example_episode_id="s3_ep_00000131"`. `novel_classes=("zero_width_evasion",)`. Generated from a stubbed `training.eval` that seeds a single `offense.code="zero_width_evasion"` into one episode's `Rewards.breakdown.anti_hack.offenses`.

**Used by:** novel-class unit test (§1.7 U22). Eval-only — not consumed by other plans.

### 5.6 Cross-plan sharing contract

This plan (`evaluation_tests.md`) is the **sole definition site** for all 5 fixtures below. Consumers import from `tests/conftest.py`; they MUST NOT redefine. `✅` = consumed (imported from conftest and actually referenced in the consuming plan). `—` = not consumed by that plan.

| Fixture | Defined in | evaluation_tests.md | training_tests.md | pitch_demo_tests.md | risk_book_tests.md |
|---|---|---|---|---|---|
| `eval_50_episodes_val_slice` | evaluation_tests.md §5.1 | ✅ consumed (definer) | ✅ consumed (integration §3.4 paired eval; see training_tests.md §5.6 footer cross-reference) | — (no eval slice needed by pitch demo tests) | — (risk-register domain, no eval slice needed) |
| `baseline_eval_report_fixture` | evaluation_tests.md §5.2 | ✅ consumed (definer) | — (eval-only; training stubs `training.eval` directly) | — | — |
| `final_eval_report_fixture` | evaluation_tests.md §5.3 | ✅ consumed (definer) | — (eval-only) | — | — |
| `probe_report_no_exploits` | evaluation_tests.md §5.4 | ✅ consumed (definer) | — (probe entry-point tested via direct mocks in U45) | ✅ consumed (I12 blog badge) | — (risk register is about `Risk.triage`, not exploit reports) |
| `probe_report_with_novel_class` | evaluation_tests.md §5.5 | ✅ consumed (definer) | — | — | — |

**Truth-verification rule:** every `✅ consumed` above is bidirectionally consistent — the consuming plan's own §5 either (a) does not re-define the fixture, AND (b) explicitly cross-references this section as the definition site. The only cross-plan consumer rows are `training_tests.md` for `eval_50_episodes_val_slice` (see `training_tests.md §5.6`) and `pitch_demo_tests.md` for `probe_report_no_exploits` (see `pitch_demo_tests.md §5.5`). All other cells are `—`.

If a downstream plan needs to mutate any of these fixtures, it must define a derived fixture (e.g., `@pytest.fixture def probe_report_no_exploits_for_demo(probe_report_no_exploits): ...`) rather than editing the shared body. Enforced by a `tests/conftest_lock.py` sha256 check over the 5 fixture sources.

---

**End of evaluation_tests.md.** This plan is sealed pending ≥ 2 fresh critic `NOTHING_FURTHER` returns per `DRIFTCALL/CLAUDE.md` §3.4.