Evaluation
Built-in benchmark runner
from src.evaluation.benchmarks import ComprehensiveBenchmarkSuite, BenchmarkConfig
suite = ComprehensiveBenchmarkSuite(BenchmarkConfig())
results = suite.run_all_benchmarks(model, datasets)
print(results["summary"])
Practical tips
- Some public benchmarks require specific data fields; dummy/val sets may be incompatible.
- Run evaluations periodically:
--eval_frequency 1during early runs to verify trends. - Log results to W&B when available.