Spaces:

DGXAI
/

driftcall

Runtime error

App Files Files Community

driftcall / cells /step_18_eval_baseline.md

saumilyajj

Upload folder using huggingface_hub

b43d8da verified about 1 month ago

preview code

raw

history blame contribute delete

818 Bytes

Cell 18 — Baseline Evaluation

eval_baseline(...) runs the untrained Gemma 3n E2B on the first 50 rows of val/briefs.jsonl under frozen-greedy sampling and returns an EvalReport with bootstrap CIs (n_boot=10_000, rng_seed=20260426).

Contract: evaluation.md §2.1, §3.1–§3.3, §3.8, §4, §5.

50 held-out val episodes, file-order (no shuffle).
env.reset(seed=hash((episode_id, "eval")) & 0xFFFFFFFF).
Greedy: temperature=0.0, num_generations=1, model.eval() + torch.no_grad().
Wall-clock ceiling 20 min; raises EvalBudgetExceededError on overrun.
No LLM-as-judge (forbidden imports listed in _NO_LLM_JUDGE_FORBIDDEN_IMPORTS).

The training-eval delegate is injected so unit tests stub model inference on CPU-only CI (training_tests.md §5.3 mock_cuda pattern).