| # Cell 18 — Baseline Evaluation | |
| `eval_baseline(...)` runs the **untrained Gemma 3n E2B** on the first 50 rows of | |
| `val/briefs.jsonl` under frozen-greedy sampling and returns an `EvalReport` | |
| with bootstrap CIs (`n_boot=10_000`, `rng_seed=20260426`). | |
| **Contract:** evaluation.md §2.1, §3.1–§3.3, §3.8, §4, §5. | |
| - 50 held-out val episodes, file-order (no shuffle). | |
| - `env.reset(seed=hash((episode_id, "eval")) & 0xFFFFFFFF)`. | |
| - Greedy: `temperature=0.0`, `num_generations=1`, `model.eval()` + `torch.no_grad()`. | |
| - Wall-clock ceiling 20 min; raises `EvalBudgetExceededError` on overrun. | |
| - No LLM-as-judge (forbidden imports listed in `_NO_LLM_JUDGE_FORBIDDEN_IMPORTS`). | |
| The training-eval delegate is **injected** so unit tests stub model inference | |
| on CPU-only CI (training_tests.md §5.3 `mock_cuda` pattern). | |