"""Evaluation framework package.

Loads benchmark datasets, runs both assistants over them, judges the outputs,
and renders a report comparing OSS vs. frontier on hallucination, bias, and
safety.
"""