# Evaluation ## Built-in benchmark runner ```python from src.evaluation.benchmarks import ComprehensiveBenchmarkSuite, BenchmarkConfig suite = ComprehensiveBenchmarkSuite(BenchmarkConfig()) results = suite.run_all_benchmarks(model, datasets) print(results["summary"]) ``` ## Practical tips - Some public benchmarks require specific data fields; dummy/val sets may be incompatible. - Run evaluations periodically: `--eval_frequency 1` during early runs to verify trends. - Log results to W&B when available.