Spaces:
Sleeping
evaluation_tests.md β Test Plan for docs/modules/evaluation.md
Target modules: training/eval_baseline.py, training/eval_final.py, training/probe_reward_hacking.py (aliased training/probe.py in test imports), training/plots.py
Spec doc: DRIFTCALL/docs/modules/evaluation.md (final sealed, 2026-04-26)
Cross-refs: DRIFTCALL/docs/modules/training.md Β§2.1 (eval() contract, Β§4.2 EvalReport), DRIFTCALL/docs/modules/rewards.md Β§3.1 (purity), Β§3.6 (exploit classes), Β§4.2 (Rewards.breakdown), DRIFTCALL/docs/modules/datasets.md Β§4.7 (val split), Β§5 (catalogue hashes), DRIFTCALL/CLAUDE.md Β§3.1 (nine-section test-plan doc).
Framework: pytest + hypothesis + unittest.mock + pytest-mpl (plot image-compare, tolerance-based).
Owner: Person B (Rewards & Tests), co-signed by Person C (Training) for the training.eval delegation path.
CUDA policy: Model inference is mocked by default. Every test that would touch a real LoRA adapter or base weight goes through the stubbed_training_eval fixture (Β§5.3), which monkeypatches training.train.eval to return a hand-crafted EvalReport. Zero CUDA calls in CI. A single @pytest.mark.cuda integration test exercises the real delegation path and is skipped when torch.cuda.is_available() is False (the default laptop / CI environment).
Deterministic RNG: numpy.random.default_rng(20260426) for baseline + final bootstrap, numpy.random.default_rng(20260427) for the probe CI, numpy.random.default_rng(20260428) for paired-difference bootstrap. Seeds are frozen in evaluation.md Β§2.4 and re-asserted at every call site.
Numeric tolerance: math.isclose(a, b, abs_tol=1e-9, rel_tol=0.0) for scalar floats; numpy.testing.assert_allclose(..., atol=1e-6, rtol=0.0) for sample arrays; byte-exact for serialized JSON (sort_keys=True, separators=(",", ":")); image diff tolerance rms < 0.5 pixel for plot snapshots.
This plan specifies 100% line coverage and β₯ 95% branch coverage on training/eval_baseline.py, training/eval_final.py, training/probe.py, and training/plots.py. Every function signature in evaluation.md Β§2, every behavior clause in Β§3.1βΒ§3.8, every error mode in Β§5, every data-structure invariant in Β§4, and every edge case in Β§7 has at least one dedicated test.
Fixtures contract (Β§5). This plan is the source of truth for all eval-related fixtures. Bidirectionally-consistent sharing map (see Β§5.6 for the authoritative truth-verified table):
eval_50_episodes_val_sliceβ defined here (Β§5.1). Imported bytraining_tests.mdΒ§3.4 only (via cross-reference attraining_tests.md Β§5.6).baseline_eval_report_fixture,final_eval_report_fixtureβ defined here (Β§5.2, Β§5.3). Eval-only; not consumed by any other plan.probe_report_no_exploitsβ defined here (Β§5.4). Imported bypitch_demo_tests.mdΒ§5.5 (blog badge I12) only.probe_report_with_novel_classβ defined here (Β§5.5). Eval-only.
Any change to fixture bodies MUST be mirrored in tests/conftest.py in lockstep. PR reviewers for any consuming plan must verify content parity against this plan's Β§5.
1. Unit tests
Organisation: one pytest module per behavior cluster. Zero real inference β every test either uses stubbed_training_eval or asserts against pure-Python helpers.
File layout under tests/test_evaluation/:
tests/test_evaluation/
__init__.py
test_run_eval_signature.py # run_eval signature + delegation (Β§2.1, Β§3.2)
test_episode_selection_deterministic.py # val[0:50], seed hashing, leak guard (Β§3.1)
test_sampling_policy_frozen.py # T=0, num_gen=1, eval(), no_grad (Β§3.2)
test_aggregation_bootstrap_ci.py # per-reward CI store (Β§3.3)
test_per_language_cohort.py # cohort means + low-n rendering (Β§3.4)
test_drift_detection_latency.py # Β§3.5 latency compute + Stage-1 NaN
test_probe_scanner_mechanics.py # Β§3.6 scan + Counter aggregation
test_probe_novel_class.py # Β§3.6 unknown offense code path
test_probe_on_base_guard.py # ProbeOnBaseModelError (Β§5)
test_probe_insufficient_samples.py # ProbeInsufficientSamplesError (Β§5)
test_episode_set_leak_error.py # EpisodeSetLeakError on mismatch (Β§5)
test_catalogue_hash_mismatch.py # CatalogueHashMismatchError (Β§5)
test_eval_budget_exceeded.py # EvalBudgetExceededError all three ceilings (Β§3.8, Β§5)
test_bootstrap_edge_cases.py # len 0 / 1 / all-identical samples (Β§2.4)
test_zero_success_baseline.py # warning + CI undefined breakdown (Β§7.1)
test_plot_rendering.py # all 4 curves, PNG exists + shape (Β§2.1, Β§3.5)
test_plot_graceful_degrade.py # WandB unavailable skips 2 plots (Β§3.5, Β§7.6)
test_render_probe_report_md.py # fixed 35-line template (Β§2.3, Β§4.5)
test_probe_report_json_roundtrip.py # sort_keys serialization (Β§4.4)
test_exploit_classes_always_emitted.py # all 5 in report even when zero (Β§3.6)
Unit test case inventory β 27 cases total (exceeds the β₯ 25 requirement).
1.1 run_eval signature + delegation β test_run_eval_signature.py
Scope: run_eval(model_path, episodes=50) -> EvalReport is a thin wrapper over training.train.eval; it must pass model_path through verbatim, default episodes=50, and return the delegate's EvalReport unchanged.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U1 | test_run_eval_default_episodes_50 |
Stub training.train.eval to capture kwargs; call run_eval("base"). |
Delegate invoked with model_path="base" and episodes=50. |
| U2 | test_run_eval_accepts_literal_base |
run_eval("base", 50). |
No error; returns stubbed EvalReport with model_path == "base". |
| U3 | test_run_eval_accepts_path_object |
run_eval(Path("/tmp/ckpt"), 50). |
Delegate invoked with model_path=Path("/tmp/ckpt"). |
| U4 | test_run_eval_propagates_model_load_error |
Stub raises EvalModelLoadError("adapter missing"). |
run_eval re-raises EvalModelLoadError; no silent fallback. |
1.2 Episode selection determinism β test_episode_selection_deterministic.py
Scope: evaluation.md Β§3.1 β baseline and final both read val/briefs.jsonl[0:50] in file order; probe reads [50:250]; env seed = hash((episode_id, "eval")) & 0xFFFFFFFF; paired (episode_id, seed) tuples must match between baseline and final.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U5 | test_eval_reads_first_50_rows_in_file_order |
eval_50_episodes_val_slice fixture + mock load_briefs recording .take(50). Call run_eval("base", 50). |
load_briefs received exactly the first 50 BriefRows; episode_ids order matches file order. |
| U6 | test_probe_reads_rows_50_to_250 |
Mock load_briefs with 500-row fixture. Call probe_reward_hacking(Path("/ckpt"), 200). |
The 200 BriefRows passed to training.eval are rows [50:250] β disjoint from the paired 50, confirmed by episode_id set intersection == β
. |
| U7 | test_env_seed_is_hash_tuple_episode_id_eval |
Mock env.reset to record seed. |
For each episode, recorded seed == hash((episode_id, "eval")) & 0xFFFFFFFF. |
| U8 | test_baseline_and_final_share_same_seeds |
stubbed_training_eval records seeds per run. Run baseline then final. |
baseline.breakdown["episode_ids"] == final.breakdown["episode_ids"]; zipped seeds are pairwise identical. |
| U9 | test_episode_set_leak_error_raised_on_mismatch |
Baseline fixture with episode_ids [a,b,c,...]; final fixture with [a,b,X,...]. Call post-run guard. |
Raises EpisodeSetLeakError with substring "paired-comparison invariant". |
1.3 Sampling policy frozen β test_sampling_policy_frozen.py
Scope: evaluation.md Β§3.2 β greedy temperature=0.0, top_k=1, num_generations=1, model.eval(), torch.no_grad(), all dropouts OFF. run_eval re-asserts these at entry.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U10 | test_run_eval_enforces_temperature_zero |
Stub training.eval with a capture-then-assert that records sampling kwargs. |
Captured temperature == 0.0; top_k == 1; num_generations == 1. |
| U11 | test_run_eval_wraps_in_no_grad_and_eval_mode |
Mock torch context; assert model.eval() called and torch.no_grad() context entered before first forward. |
model.eval call_count β₯ 1; no_grad.__enter__ called before first forward. |
| U12 | test_run_eval_dropouts_off |
Stub model with .train() recorded; assert .eval() wins and dropout modules report .training is False. |
All nn.Dropout / LoRA-dropout modules have training is False at sample time. |
1.4 Aggregation + bootstrap CI β test_aggregation_bootstrap_ci.py
Scope: evaluation.md Β§3.3 β bootstrap_ci(samples, n_boot=10_000, alpha=0.05, rng_seed=20260426) called once per reward channel; results stored on EvalReport.r{i}_mean_ci tuple.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U13 | test_bootstrap_ci_default_n_boot_10000 |
Call bootstrap_ci(tuple(range(50)), n_boot=10_000, rng_seed=20260426). |
Returned (mean, lo, hi) triple; lo < mean < hi; mean within abs_tol=1e-9 of arithmetic mean. |
| U14 | test_bootstrap_ci_deterministic_with_seed |
Call twice with identical args. | Byte-identical (mean, lo, hi) on both calls (re-run determinism). |
| U15 | test_paired_difference_ci_uses_seed_20260428 |
paired_difference_ci(baseline, final) with mocked rng capture. |
numpy.random.default_rng called with 20260428; output tuple reproducible. |
1.5 Bootstrap edge cases β test_bootstrap_edge_cases.py
Scope: evaluation.md Β§2.4 β len 0 β (nan, nan, nan); len 1 β (v, v, v); all-identical β (v, v, v) no variance.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U16 | test_bootstrap_ci_len0_returns_all_nan |
bootstrap_ci(tuple(), n_boot=10_000). |
All three outputs math.isnan(...). |
| U17 | test_bootstrap_ci_len1_returns_triple_v |
bootstrap_ci((0.42,), n_boot=10_000). |
Returns (0.42, 0.42, 0.42) exactly. |
1.6 Per-language cohort rendering β test_per_language_cohort.py
Scope: evaluation.md Β§3.4 β cohorts with n_episodes >= 5 render numeric mean + 95% CI; 1 <= n <= 4 renders striped low-n bar with label (low-n); n == 0 renders empty slot labelled (no episodes). No CI computed for low-n or empty cohorts.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U18 | test_cohort_ge_5_renders_numeric_with_ci |
Fixture cohort of 11 episodes. | Rendered bar payload has n_episodes == 11, numeric mean, ci != None. |
| U19 | test_cohort_low_n_1_to_4_renders_striped |
Fixture cohort of 3 episodes. | Rendered bar carries style == "striped", label.endswith("(low-n)"), ci is None. |
| U20 | test_cohort_empty_renders_empty_slot |
Fixture cohort of 0 episodes. | Rendered entry has n_episodes == 0, label.endswith("(no episodes)"), mean is float("nan"). |
1.7 Probe scanner + novel class β test_probe_scanner_mechanics.py, test_probe_novel_class.py
Scope: evaluation.md Β§3.6 β all 5 exploit classes always enumerated in the report (count β₯ 0); any offense.code β EXPLOIT_CLASSES surfaces as a Novel exploit (threshold = 1 occurrence).
| # | Name | Setup | Assertion |
|---|---|---|---|
| U21 | test_probe_emits_all_5_known_classes_even_when_zero |
Fixture rewards with zero anti-hack offenses. Call probe_reward_hacking. |
len(report.per_class) >= 5; every member of EXPLOIT_CLASSES present; each has count == 0, rate == 0.0, example_episode_id is None. |
| U22 | test_probe_novel_class_surfaced_on_single_instance |
Fixture: one episode with offense.code == "zero_width_evasion". |
report.novel_classes == ("zero_width_evasion",); summary row for it has count == 1; markdown contains substring "UNKNOWN EXPLOIT CLASS". |
1.8 Probe guards β test_probe_on_base_guard.py, test_probe_insufficient_samples.py
Scope: evaluation.md Β§5 β ProbeOnBaseModelError when model_path=="base"; ProbeInsufficientSamplesError when episodes < 50.
| # | Name | Setup | Assertion |
|---|---|---|---|
| U23 | test_probe_on_base_raises |
probe_reward_hacking("base", 200). |
Raises ProbeOnBaseModelError; no call to training.eval. |
| U24 | test_probe_insufficient_samples_raises |
probe_reward_hacking(Path("/ckpt"), 49). |
Raises ProbeInsufficientSamplesError with substring "n < 50". |
1.9 Catalogue hash + budget β test_catalogue_hash_mismatch.py, test_eval_budget_exceeded.py
Scope: evaluation.md Β§3.1 (catalogue pinning) and Β§3.8 (wall-clock ceilings 20 min / 60 min / 2 min, raising EvalBudgetExceededError).
| # | Name | Setup | Assertion |
|---|---|---|---|
| U25 | test_catalogue_hash_mismatch_blocks_eval |
Fixture BriefRow.catalogue_hash set to "stale"; currently-loaded yaml hashes to "current". Call run_eval. |
Raises CatalogueHashMismatchError before any rollout; no training.eval call. |
| U26 | test_run_eval_budget_20min_exceeded_raises |
Monkeypatch time.monotonic to simulate 20 min 1 s elapsed. |
Raises EvalBudgetExceededError with substring "run_eval" and "20 min". |
| U27 | test_probe_budget_60min_and_plot_budget_2min |
Parametrized across (probe_reward_hacking, 60*60+1) and (render_plots, 120+1). |
Both raise EvalBudgetExceededError naming their respective ceiling. |
2. Property tests
Property tests use hypothesis.strategies to generate arbitrary-but-bounded inputs and assert invariants evaluation.md commits to. Each strategy is seeded (hypothesis.seed(20260426)) so failures are reproducible.
Property inventory β 6 properties total (exceeds the β₯ 5 requirement).
2.1 Eval purity β same (model, 50-ep) produces byte-identical EvalReport
@given(checkpoint=st.sampled_from([Path("/ckpt/a"), Path("/ckpt/b")]))
@settings(max_examples=10, deadline=None)
def test_eval_is_pure(checkpoint, stubbed_training_eval):
r1 = run_eval(checkpoint, episodes=50)
r2 = run_eval(checkpoint, episodes=50)
assert serialize(r1) == serialize(r2)
Invariant: For the same checkpoint and the same 50-row slice, two back-to-back run_eval calls produce EvalReport records whose canonical JSON (sort_keys=True, separators=(",", ":")) byte-compares equal, including every r{i}_mean_ci tuple and every WandB-independent curves entry. This is evaluation.md Β§1 "Deterministic on re-run" invariant, bound to the fixed rng_seed=20260426 in bootstrap_ci.
2.2 Bootstrap CI convergence β jitter β€ 0.001 at n_boot=10_000
@given(samples=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False),
min_size=50, max_size=50))
@settings(max_examples=20, deadline=None)
def test_bootstrap_ci_converges(samples):
m1, lo1, hi1 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
m2, lo2, hi2 = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426 + 1)
assert abs(lo1 - lo2) <= 0.001
assert abs(hi1 - hi2) <= 0.001
Invariant: At n_boot=10_000 the 2.5th / 97.5th percentile estimates differ by at most 0.001 across distinct bootstrap seeds on the same underlying samples β the Monte-Carlo jitter ceiling referenced by evaluation.md Β§3.3. This property defines "convergent enough" and guards against anyone lowering n_boot silently.
2.3 Paired-diff CI equals final β baseline per episode
@given(b=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50),
f=st.lists(st.floats(min_value=0.0, max_value=1.0, allow_nan=False), min_size=50, max_size=50))
@settings(max_examples=30, deadline=None)
def test_paired_diff_ci_is_paired(b, f):
pd_mean, _, _ = paired_difference_ci(tuple(b), tuple(f),
n_boot=10_000, rng_seed=20260428)
expected = sum(fi - bi for bi, fi in zip(b, f)) / 50
assert math.isclose(pd_mean, expected, abs_tol=1e-9)
Invariant: The paired-difference mean is the per-index final[i] - baseline[i] arithmetic mean β not the difference of independent means. Guards against anyone accidentally computing mean(final) - mean(baseline) via two unlinked samples (evaluation.md Β§2.4 paired_difference_ci docstring + Β§7 edge case 7).
2.4 Paired-diff CI requires equal lengths
@given(n_b=st.integers(min_value=1, max_value=100),
n_f=st.integers(min_value=1, max_value=100))
def test_paired_diff_requires_equal_lengths(n_b, n_f):
if n_b == n_f:
return # happy path covered elsewhere
with pytest.raises(EpisodeSetLeakError):
paired_difference_ci(tuple([0.1] * n_b), tuple([0.2] * n_f))
Invariant: Unequal lengths raise EpisodeSetLeakError β pairing is strictly index-aligned (evaluation.md Β§2.4).
2.5 Bootstrap CI bracketing β lo β€ mean β€ hi
@given(samples=st.lists(st.floats(min_value=-1.0, max_value=1.0, allow_nan=False),
min_size=2, max_size=200))
def test_bootstrap_ci_brackets_mean(samples):
m, lo, hi = bootstrap_ci(tuple(samples), n_boot=10_000, rng_seed=20260426)
assert lo <= m <= hi
Invariant: The percentile bootstrap CI always brackets the point estimate. Guards against transposed percentile extraction (e.g., accidentally returning 97.5 as lo).
2.6 Probe class closure β every emitted class is either known or appears in novel_classes
@given(offense_codes=st.lists(st.text(min_size=1, max_size=32), min_size=0, max_size=40))
def test_probe_class_closure(offense_codes, stubbed_training_eval_with_offenses):
report = probe_reward_hacking(Path("/ckpt"), episodes=200)
for row in report.per_class:
assert row.exploit_class in EXPLOIT_CLASSES or row.exploit_class in report.novel_classes
Invariant: Every ProbeExploitClassSummary.exploit_class value in report.per_class is either one of the 5 known classes or is listed verbatim in report.novel_classes. No summary row exists outside this closure β protects the discovery channel.
3. Integration tests
Integration tests exercise cross-module wiring end-to-end. Real training.train.eval is stubbed (no CUDA), but the full evaluation pipeline β run_eval β EvalReport β render_plots and probe_reward_hacking β scan β render_probe_report_md β runs on actual dataclass instances.
File: tests/test_evaluation/test_integration.py.
3.1 Baseline eval on base model (stubbed)
test_integration_baseline_on_base_model
Setup: stubbed_training_eval returns baseline_eval_report_fixture.
eval_50_episodes_val_slice loaded from fixtures/val_briefs_50.jsonl.
Action: run_eval("base", 50)
Assert: EvalReport.model_path == "base"
EvalReport.n_episodes == 50
len(r1_mean_ci) == 3 and lo <= mean <= hi for every r{i}_mean_ci
breakdown["episode_ids"] == tuple of 50 episode_ids from val[0:50]
JSON round-trip through canonical serializer is byte-stable.
3.2 Final eval on trained LoRA (stubbed)
test_integration_final_on_trained_lora
Setup: stubbed_training_eval returns final_eval_report_fixture.
baseline.json already present on disk (from Β§3.1).
Action: run_eval(Path("/fake/ckpt/stage3_final"), 50)
then post-run guard: assert baseline.episode_ids == final.episode_ids
Assert: EvalReport.model_path == "/fake/ckpt/stage3_final"
EvalReport.reward_mean_ci[0] > baseline.reward_mean_ci[0]
paired_difference_ci stored under breakdown["paired_ci"]
No EpisodeSetLeakError raised.
3.3 Probe 200 episodes β markdown report
test_integration_probe_200_episodes_produces_markdown
Setup: stubbed_training_eval returns 200 Rewards records with:
- 2 hallucinated_field offenses
- 1 bare_drift_claim offense
- 0 of the other 3 classes
Action: report = probe_reward_hacking(Path("/fake/ckpt"), 200)
md_path = render_probe_report_md(report, tmp_path / "probe_report.md")
Assert: md_path.exists()
md_text.count("### ") == 5 # all 5 known classes present
"Novel exploit classes: none" in md_text
"**Total offenses:** 3" in md_text
json_round_trip(report) bytes-stable
probe_report.json passes schema validation.
3.4 Plot rendering for all 4 target curves
test_integration_render_all_4_plots
Setup: baseline_eval_report_fixture, final_eval_report_fixture,
stubbed WandB run-history returning R{1..5}_mean per step and
eval/drift_latency_p50+p95 at steps {50,100,150,200,300,400,500}.
Action: paths = render_plots(baseline, final, wandb_run_id="stub-run",
out_dir=tmp_path)
Assert: set(paths.keys()) == {
"per_reward_stack", "drift_latency_vs_step",
"per_language_bars", "before_after_bars"
}
every path .exists() and .stat().st_size > 1024 bytes
PIL.Image.open(path).size == (1600, 900) # canonical figsize
pytest-mpl snapshot compare passes with rms < 0.5 per plot.
3.5 WandB unavailable β graceful degrade (2 plots)
test_integration_render_plots_without_wandb
Setup: render_plots(..., wandb_run_id=None)
Assert: set(paths.keys()) == {"per_language_bars", "before_after_bars"}
WandBHistoryUnavailableWarning emitted (captured via pytest.warns)
no PlotRenderError raised
returned dict omits the two history-driven plots.
3.6 GPU delegation path (skipped on CPU-only CI)
@pytest.mark.cuda
test_integration_real_training_eval_delegation
Setup: real Gemma 3n E2B + toy LoRA adapter on a 2-episode smoke slice.
Assert: run_eval returns a valid EvalReport; no exception.
Skipped automatically when torch.cuda.is_available() is False.
4. Coverage target
Line coverage: 100% on each of:
training/eval_baseline.pytraining/eval_final.pytraining/probe.py(aliased fortraining/probe_reward_hacking.py)training/plots.py
Branch coverage: β₯ 95% on the same files. Exclusions (via # pragma: no cover with justification comment):
if TYPE_CHECKING:import blocks.- Real-CUDA-only branches inside
training.evaldelegation that thestubbed_training_evalfixture bypasses (marked with# pragma: no cover-stubbed). matplotlibbackend-selection dead code paths (backend == "Agg"already forced at module import).
Verification command:
pytest tests/test_evaluation/ \
--cov=training.eval_baseline \
--cov=training.eval_final \
--cov=training.probe \
--cov=training.plots \
--cov-branch \
--cov-fail-under=100 \
--cov-report=term-missing
Branch coverage is independently enforced via --cov-branch against a local threshold file .coveragerc that sets fail_under_branch = 95.
Per-file test β line mapping (authoritative):
| File | Covering test file(s) | Targeted lines |
|---|---|---|
training/eval_baseline.py |
test_run_eval_signature.py, test_episode_selection_deterministic.py, test_sampling_policy_frozen.py, test_zero_success_baseline.py, test_eval_budget_exceeded.py |
CLI argparse, run_eval("base", β¦) call, baseline.json write, ZeroSuccessBaselineWarning path. |
training/eval_final.py |
test_run_eval_signature.py, test_episode_set_leak_error.py, test_aggregation_bootstrap_ci.py, test_plot_rendering.py, test_plot_graceful_degrade.py |
CLI, run_eval(ckpt, β¦), paired-diff CI store, render_plots call, EpisodeSetLeakError guard at exit. |
training/probe.py |
test_probe_scanner_mechanics.py, test_probe_novel_class.py, test_probe_on_base_guard.py, test_probe_insufficient_samples.py, test_render_probe_report_md.py, test_probe_report_json_roundtrip.py, test_exploit_classes_always_emitted.py |
probe_reward_hacking, scan_episode_for_exploits, render_probe_report_md, JSON serializer, all 5 guard branches. |
training/plots.py |
test_plot_rendering.py, test_plot_graceful_degrade.py |
All 4 plot functions, WandB history fetcher, graceful-degrade branching, PlotRenderError path. |
5. Fixtures
All fixtures live in tests/conftest.py (repo-root shared scope) so the 4 shared names are identical content across training_tests.md, pitch_demo_tests.md, risk_book_tests.md, and this plan. Fixture IDs are the function names registered via @pytest.fixture.
5.1 eval_50_episodes_val_slice
Shape: tuple[BriefRow, ...] of length 50. Sourced from tests/fixtures/val_briefs_50.jsonl β the first 50 rows of a publication seed of val/briefs.jsonl (datasets.md Β§4.7), committed verbatim (β€ 120 KiB). Each row carries episode_id, seed, catalogue_hash, templates_sha256, i18n_sha256, goal.language β {hi, ta, kn, en, hinglish}, and the embedded GoalSpec.
Shared with: training_tests.md (imported under the same canonical name eval_50_episodes_val_slice for EpisodeDatasetAdapter iteration in integration test 3.4 β see training_tests.md Β§5 footer cross-reference), pitch_demo_tests.md (demo rollout harness), risk_book_tests.md (risk-book sample plots). This plan is the sole definition site; all consumers import from tests/conftest.py.
5.2 baseline_eval_report_fixture
Shape: A hand-built EvalReport mirroring the numbers in evaluation.md Β§8.1 verbatim:
model_path = "base",n_episodes = 50reward_mean_ci = (0.118, 0.086, 0.152),r1_mean_ci = (0.100, 0.040, 0.180), etc.- 5
PerLanguageReports (hi n=11, ta n=10, kn n=9, en n=10, hinglish n=10) drift_detection_latencywith stage2/stage3 all NaN,undetected_count=27breakdown["episode_ids"] = tuple_of_50_idsdeterministicreward_hacking_offenses = {"hallucinated_field": 7, ...}per Β§8.1
Used by: integration Β§3.1, Β§3.4, Β§3.5. Eval-only β not shared with training_tests.md (which constructs EvalReports inline via stubbed training.eval).
5.3 final_eval_report_fixture
Shape: Hand-built EvalReport matching evaluation.md Β§8.2:
model_path = "/abs/path/checkpoints/stage3_final",n_episodes = 50reward_mean_ci = (0.542, 0.480, 0.604), etc.drift_detection_latencywith stage2_mean=1.2, stage3_mean=1.6, undetected_count=9curvesdict with the 4 keys enumerated in Β§8.2breakdown["paired_ci"]populated with ΞR1, ΞR2, Ξreward_mean, Ξdrift_latency triples- Shares
episode_idstuple with Β§5.2 (paired-comparison invariant)
Used by: integration Β§3.2, Β§3.4, Β§3.5. Eval-only β not shared with training_tests.md.
5.4 probe_report_no_exploits
Shape: ProbeReport with n_episodes=200, per_class containing all 5 known classes at count=0, rate=0.0, example_episode_id=None, total_hits=0, novel_classes=(). Generated from a stubbed training.eval that returns 200 Rewards records with empty anti_hack.offenses lists.
Used by: integration Β§3.3 (happy-path markdown rendering), pitch_demo_tests.md (probe-artefact badge I12). Not consumed by risk_book_tests.md (risk-book's domain is Risk.triage / risk register, not exploit reports).
5.5 probe_report_with_novel_class
Shape: ProbeReport with n_episodes=200, per_class containing all 5 known classes (zeros) plus a sixth ProbeExploitClassSummary for exploit_class="zero_width_evasion" with count=1, rate=0.005, example_episode_id="s3_ep_00000131". novel_classes=("zero_width_evasion",). Generated from a stubbed training.eval that seeds a single offense.code="zero_width_evasion" into one episode's Rewards.breakdown.anti_hack.offenses.
Used by: novel-class unit test (Β§1.7 U22). Eval-only β not consumed by other plans.
5.6 Cross-plan sharing contract
This plan (evaluation_tests.md) is the sole definition site for all 5 fixtures below. Consumers import from tests/conftest.py; they MUST NOT redefine. β
= consumed (imported from conftest and actually referenced in the consuming plan). β = not consumed by that plan.
| Fixture | Defined in | evaluation_tests.md | training_tests.md | pitch_demo_tests.md | risk_book_tests.md |
|---|---|---|---|---|---|
eval_50_episodes_val_slice |
evaluation_tests.md Β§5.1 | β consumed (definer) | β consumed (integration Β§3.4 paired eval; see training_tests.md Β§5.6 footer cross-reference) | β (no eval slice needed by pitch demo tests) | β (risk-register domain, no eval slice needed) |
baseline_eval_report_fixture |
evaluation_tests.md Β§5.2 | β consumed (definer) | β (eval-only; training stubs training.eval directly) |
β | β |
final_eval_report_fixture |
evaluation_tests.md Β§5.3 | β consumed (definer) | β (eval-only) | β | β |
probe_report_no_exploits |
evaluation_tests.md Β§5.4 | β consumed (definer) | β (probe entry-point tested via direct mocks in U45) | β consumed (I12 blog badge) | β (risk register is about Risk.triage, not exploit reports) |
probe_report_with_novel_class |
evaluation_tests.md Β§5.5 | β consumed (definer) | β | β | β |
Truth-verification rule: every β
consumed above is bidirectionally consistent β the consuming plan's own Β§5 either (a) does not re-define the fixture, AND (b) explicitly cross-references this section as the definition site. The only cross-plan consumer rows are training_tests.md for eval_50_episodes_val_slice (see training_tests.md Β§5.6) and pitch_demo_tests.md for probe_report_no_exploits (see pitch_demo_tests.md Β§5.5). All other cells are β.
If a downstream plan needs to mutate any of these fixtures, it must define a derived fixture (e.g., @pytest.fixture def probe_report_no_exploits_for_demo(probe_report_no_exploits): ...) rather than editing the shared body. Enforced by a tests/conftest_lock.py sha256 check over the 5 fixture sources.
End of evaluation_tests.md. This plan is sealed pending β₯ 2 fresh critic NOTHING_FURTHER returns per DRIFTCALL/CLAUDE.md Β§3.4.