evaluation-guidebook

Running

evaluation-guidebook / app /src /content /article.mdx

Clémentine

comments on contamination + saturation

517d5ef 14 days ago

9.8 kB

	---
	title: "The LLM Evaluation Guidebook"
	subtitle: "All the things you could want to know about LLM evaluation based on our experience scoring 15000 models over 3 years"
	description: "Understanding the tips and tricks of evaluating an LLM in 2025"
	authors:
	- name: "Clémentine Fourrier"
	url: "https://huggingface.co/clefourrier"
	affiliations: [1]
	- name: "Thibaud Frere"
	url: "https://huggingface.co/tfrere"
	affiliations: [1]
	- name: "Guilherme Penedo"
	url: "https://huggingface.co/guipenedo"
	affiliations: [1]
	- name: "Thomas Wolf"
	url: "https://huggingface.co/thomwolf"
	affiliations: [1]
	affiliations:
	- name: "Hugging Face"
	url: "https://huggingface.co"
	published: "Dec. 03, 2025"
	tags:
	- research
	- evaluation
	tableOfContentsAutoCollapse: true
	---

	import Note from "../components/Note.astro";
	import Sidenote from "../components/Sidenote.astro";
	import HtmlEmbed from "../components/HtmlEmbed.astro";

	import Intro from "./chapters/intro.mdx";
	import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
	import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx";
	import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx"
	import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
	import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";

	<Intro />

	## LLM basics to understand evaluation

	Now that you have an idea of why evaluation is important to different people, let's look at how we prompt models to get some answers out in order to evaluate them. You can skim this section if you have already done evaluation and mostly look for the notes and sidenotes.


	<ModelInferenceAndEvaluation />

	## Evaluating with existing benchmarks

	Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team.

	<Note title="Important concepts" emoji="⚠️" variant="info">
	In this section, you'll see two concepts mentionned quite a lot: contamination and saturation.

	Saturation is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models.
	<Sidenote> It's what you observe in the banner picture! </Sidenote>

	If all models have close to the highest possible score on your evaluation, it's no longer a discriminative benchmark. It's similar to evaluating high school students on pre-school problems: success tells you nothing (though failure is indicative).

	Contamination is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task.

	It's a bit like evaluating a student on questions it already knows in advance.

	</Note>

	### Benchmarks to know in 2025

	<EvalsIn2025 />


	### Understanding what's in there

	No matter how you selected your initial datasets, the most important step is, and always will be, to look at the data, both what you have, what the model generates, and its scores. In the end, that's the only way you'll see if your evaluations are actually relevant for your specific use case.

	You want to study the following.

	#### Data creation process
	- Who created the actual samples?
	Ideally, you want dataset created by experts, then next tier is paid annotators, then crowdsourced, then synthetic, then MTurked. You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity, or potential cultural bias.

	- Were they all examined by other annotators or by the authors?
	You want to know if the inter-annotator score on samples is high (= are annotators in agreement?) and/or if the full dataset has been examined by the authors.
	This is especially important for datasets with the help of underpaid annotators who usually are not native speakers of your target language (think AWS Mechanical Turk), as you might otherwise find typos/grammatical errors/nonsensical answers.

	- Were the annotators provided with clear data creation guidelines?
	In other words, is your dataset consistent?

	#### Samples inspection
	Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".

	First, you want to check the content quality.
	- Are the prompts clear and unambiguous?
	- Are the answers correct? (Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.)
	- Is information missing? (Eg: MMLU references absent schematics in a number of questions.)

	It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.

	Then, you want to check for relevance to your task. Are these questions the kind of questions you want to evaluate an LLM on? Are these examples relevant to your use case?

	You might also want to check the samples consistency (especially if you're planning on using few shots or computing aggregated statistics): do all samples have the same number of choices if it's a multiple choice evaluation? Is the spacing consistent before and after the prompt? If your evaluation comes with an additional environment, ideally you want to use it to understand what gets called.

	Lastly, you also want to quickly check how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).

	In the below viewer, for example, you can inspect the first samples of well known post-training benchmarks, collected by Lewis.

	<iframe
	src="https://huggingface.co/datasets/HuggingFaceTB/post-training-benchmarks-viewer/embed/viewer/aime25/test"
	frameborder="0"
	width="100%"
	height="560px"
	></iframe>

	#### Task and metrics

	You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type. Best (but rarest) metrics are functional or based on rule based verifiers <Sidenote> When doing code evals, beware of too easy pass/fail unit tests! Recent LLMs have become very good at overwriting globals to 'cheat', especially in languages like Python where you can mess up variable scope.</Sidenote>

	### So, you can't reproduce reported model scores?

	<TroubleshootingReproducibility />

	### Selecting good benchmarks automatically for model training

	<PickingYourEval />



	## Creating your own evaluation

	At this stage, you likely have a good idea of why people do evaluation, which benchmarks exist and are relevant for different model stages (training, inference of base and tuned models), but what if nothing exists for your specific use case?

	This is precisely when you could want to create your own evaluation.

	<DesigningAutomaticEvaluation />

	## Conclusion

	Evaluation is both an art and a science. We've explored the landscape of LLM evaluation in 2025—from understanding why we evaluate models and the fundamental mechanics of tokenization and inference, to navigating the ever-evolving ecosystem of benchmarks, and finally to creating evaluations for your own use-cases.

	Key things I hope you'll remember are:

	Think critically about what you're measuring. Evaluations are proxies for capabilities, so a high score on a benchmark doesn't guarantee real-world performance. Different evaluation approaches (automatic metrics, human judges, or model judges) each come with their own biases, limitations, and tradeoffs.

	Match your evaluation to your goal. Are you running ablations during training? Use fast, reliable benchmarks with strong signal even on small models. Comparing final models for selection? Focus on harder, uncontaminated datasets that test holistic capabilities. Building for a specific use case? Create custom evaluations that reflect your problems and data.

	Reproducibility requires attention to detail. Small differences in prompts, tokenization, normalization, templates, or random seeds can swing scores by several points. When reporting results, be transparent about your methodology. When trying to reproduce results, expect that exact replication will be extremely challenging even if you attempt to control for every variable.

	Prefer interpretable evaluation methods. When possible, functional testing and rule-based verifiers should be chosen over model judges. Evaluations that can be understood and debugged will provide clearer and more actionable insights... and the more interpretable your evaluation, the more you can improve your models!

	Evaluation is never finished. As models improve, benchmarks saturate. As training data grows, contamination becomes more likely. As use cases evolve, new capabilities need measuring. Evaluation is an ongoing battle!

	To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading!


	### Acknowledgments

	Many thanks to all the people who contributed directly or indirectly to this document, notably Hynek Kydlicek, Loubna Ben Allal, Sander Land and Nathan Habib.