Cell 18 — Baseline Evaluation
eval_baseline(...) runs the untrained Gemma 3n E2B on the first 50 rows of
val/briefs.jsonl under frozen-greedy sampling and returns an EvalReport
with bootstrap CIs (n_boot=10_000, rng_seed=20260426).
Contract: evaluation.md §2.1, §3.1–§3.3, §3.8, §4, §5.
- 50 held-out val episodes, file-order (no shuffle).
env.reset(seed=hash((episode_id, "eval")) & 0xFFFFFFFF).- Greedy:
temperature=0.0,num_generations=1,model.eval()+torch.no_grad(). - Wall-clock ceiling 20 min; raises
EvalBudgetExceededErroron overrun. - No LLM-as-judge (forbidden imports listed in
_NO_LLM_JUDGE_FORBIDDEN_IMPORTS).
The training-eval delegate is injected so unit tests stub model inference
on CPU-only CI (training_tests.md §5.3 mock_cuda pattern).