Clémentine
Init
ffdff5d
---
title: "Intro"
---
import HtmlEmbed from "../../components/HtmlEmbed.astro";
import Note from "../../components/Note.astro";
import Sidenote from "../../components/Sidenote.astro";
import Quote from "../../components/Quote.astro";
## What is model evaluation about?
As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely stumbled upon:
<Quote>
How can one know if a model is *good*?
</Quote>
The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
But what is evaluation, really? And what can it really tell you?
This guide is here to help you understand it all: what evaluation can and cannot do, when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully help you learn how to think critically about the claims made from evaluation results.
<Sidenote>In this guide, we focus on evaluations for language (mostly natural language), but many principles also apply to other modalities </Sidenote>
Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
### The model builder perspective: Am I building a strong model?
If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
<HtmlEmbed src="d3-ablation-workflow.html" title="Ablation example"/>
For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released - we'll have a look at those below). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
<Note title="Small caveat">
Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say **"this model is the best on these samples for this specific task that we hope are a good proxy for this capability, without any guarantee"**.
</Note>
(You can still claim you are SOTA, just keep the caveat in mind.)
### The model user perspective: Which model is the best on \<task\>?
You want to use a model someone else trained for your specific use case, without performing additional training, or maybe you will perform additional training and are looking for the best existing model to use as a base.
For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it's unlikely the next best models will work).
You could want to run the evaluation and comparisons yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below.
<Sidenote>
In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods.
</Sidenote>
Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
<Note title="Takeaways" emoji="🎯" variant="info">
- Model builder: You need fast, high-signal benchmarks that cover the domains/capabilities you care about and can be run repeatedly during ablations.
- Model user: You need benchmarks that match your specific use case, even if that means creating custom ones.
</Note>
<Note title="What about measuring AGI?">
We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it.
</Note>