Vedisasi
/

UltraThinking-LLM-Training

Model card Files Files and versions

UltraThinking-LLM-Training / docs /evaluation.md

Vedisasi's picture

Upload folder using huggingface_hub

54c5666 verified 4 months ago

|

history blame contribute delete

525 Bytes

	# Evaluation

	## Built-in benchmark runner
	```python
	from src.evaluation.benchmarks import ComprehensiveBenchmarkSuite, BenchmarkConfig

	suite = ComprehensiveBenchmarkSuite(BenchmarkConfig())
	results = suite.run_all_benchmarks(model, datasets)
	print(results["summary"])
	```

	## Practical tips
	- Some public benchmarks require specific data fields; dummy/val sets may be incompatible.
	- Run evaluations periodically: `--eval_frequency 1` during early runs to verify trends.
	- Log results to W&B when available.