Spaces:

DGXAI
/

driftcall

Runtime error

App Files Files Community

driftcall / cells /step_18_eval_baseline.md

saumilyajj's picture

Upload folder using huggingface_hub

b43d8da verified about 1 month ago

|

history blame contribute delete

818 Bytes

	# Cell 18 — Baseline Evaluation

	`eval_baseline(...)` runs the untrained Gemma 3n E2B on the first 50 rows of
	`val/briefs.jsonl` under frozen-greedy sampling and returns an `EvalReport`
	with bootstrap CIs (`n_boot=10_000`, `rng_seed=20260426`).

	Contract: evaluation.md §2.1, §3.1–§3.3, §3.8, §4, §5.

	- 50 held-out val episodes, file-order (no shuffle).
	- `env.reset(seed=hash((episode_id, "eval")) & 0xFFFFFFFF)`.
	- Greedy: `temperature=0.0`, `num_generations=1`, `model.eval()` + `torch.no_grad()`.
	- Wall-clock ceiling 20 min; raises `EvalBudgetExceededError` on overrun.
	- No LLM-as-judge (forbidden imports listed in `_NO_LLM_JUDGE_FORBIDDEN_IMPORTS`).

	The training-eval delegate is injected so unit tests stub model inference
	on CPU-only CI (training_tests.md §5.3 `mock_cuda` pattern).