Evaluate model performance on the TimeBench benchmark
Assess text similarity and accuracy scores
Assess model performance over time with automated evaluation