evaluation-guidebook

Running

Clémentine commited on Dec 3, 2025

Commit

fdd0940

1 Parent(s): ffdff5d

mini fix

Files changed (1) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -552,7 +552,7 @@ You can also compute these with prompt variations, by asking the same questions
 ### Cost and efficiency
-When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
 <div className="card" style="height: fit-content; max-width: 75%; margin: 40px auto;">
 <img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="height: auto !important; object-fit: contain !important; display: block; margin: 0 auto;" />

 ### Cost and efficiency
+When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 1 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
 <div className="card" style="height: fit-content; max-width: 75%; margin: 40px auto;">
 <img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="height: auto !important; object-fit: contain !important; display: block; margin: 0 auto;" />