driftcall / cells /step_18_eval_baseline.md
saumilyajj's picture
Upload folder using huggingface_hub
b43d8da verified
# Cell 18 — Baseline Evaluation
`eval_baseline(...)` runs the **untrained Gemma 3n E2B** on the first 50 rows of
`val/briefs.jsonl` under frozen-greedy sampling and returns an `EvalReport`
with bootstrap CIs (`n_boot=10_000`, `rng_seed=20260426`).
**Contract:** evaluation.md §2.1, §3.1–§3.3, §3.8, §4, §5.
- 50 held-out val episodes, file-order (no shuffle).
- `env.reset(seed=hash((episode_id, "eval")) & 0xFFFFFFFF)`.
- Greedy: `temperature=0.0`, `num_generations=1`, `model.eval()` + `torch.no_grad()`.
- Wall-clock ceiling 20 min; raises `EvalBudgetExceededError` on overrun.
- No LLM-as-judge (forbidden imports listed in `_NO_LLM_JUDGE_FORBIDDEN_IMPORTS`).
The training-eval delegate is **injected** so unit tests stub model inference
on CPU-only CI (training_tests.md §5.3 `mock_cuda` pattern).