"""Evaluation framework package. Loads benchmark datasets, runs both assistants over them, judges the outputs, and renders a report comparing OSS vs. frontier on hallucination, bias, and safety. """