driftcall / cells /step_19_eval_final.md
saumilyajj's picture
Upload folder using huggingface_hub
b43d8da verified

Cell 19 — Final Evaluation (Post-Training LoRA)

eval_final(checkpoint, ..., baseline=baseline_report) runs the trained LoRA on the same 50 paired episodes used by the baseline (evaluation.md §3.1) and stores the paired-difference 95% CIs under EvalReport.breakdown['paired_ci'].

Contract: evaluation.md §2.1, §3.1, §3.3, §3.8, §5 EpisodeSetLeakError.

  • EpisodeSetLeakError raised at entry AND exit if baseline.episode_ids ≠ val/briefs.jsonl[0:50] or the post-rollout report's IDs diverge.
  • Paired bootstrap CI seed = 20260428 (evaluation.md §2.4).
  • Wall-clock budget 20 min — same ceiling as baseline.