feat: implement generalization testing pipeline and report generation for held-out evaluation scenarios ae22694
ayhm23 commited on