|
|
--- |
|
|
title: "The LLM Evaluation Guidebook" |
|
|
subtitle: "All the things you could want to know about LLM evaluation based on our experience scoring 15000 models over 3 years" |
|
|
description: "Understanding the tips and tricks of evaluating an LLM in 2025" |
|
|
authors: |
|
|
- name: "Clémentine Fourrier" |
|
|
url: "https://huggingface.co/clefourrier" |
|
|
affiliations: [1] |
|
|
- name: "Thibaud Frere" |
|
|
url: "https://huggingface.co/tfrere" |
|
|
affiliations: [1] |
|
|
- name: "Guilherme Penedo" |
|
|
url: "https://huggingface.co/guipenedo" |
|
|
affiliations: [1] |
|
|
- name: "Thomas Wolf" |
|
|
url: "https://huggingface.co/thomwolf" |
|
|
affiliations: [1] |
|
|
affiliations: |
|
|
- name: "Hugging Face" |
|
|
url: "https://huggingface.co" |
|
|
published: "Dec. 03, 2025" |
|
|
tags: |
|
|
- research |
|
|
- evaluation |
|
|
tableOfContentsAutoCollapse: true |
|
|
--- |
|
|
|
|
|
import Note from "../components/Note.astro"; |
|
|
import Sidenote from "../components/Sidenote.astro"; |
|
|
import HtmlEmbed from "../components/HtmlEmbed.astro"; |
|
|
|
|
|
import Intro from "./chapters/intro.mdx"; |
|
|
import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx"; |
|
|
import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx"; |
|
|
import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx" |
|
|
import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx"; |
|
|
import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx"; |
|
|
|
|
|
<Intro /> |
|
|
|
|
|
## LLM basics to understand evaluation |
|
|
|
|
|
Now that you have an idea of why evaluation is important to different people, let |
|
|
|
|
|
|
|
|
<ModelInferenceAndEvaluation /> |
|
|
|
|
|
## Evaluating with existing benchmarks |
|
|
|
|
|
Now that you |
|
|
|
|
|
<Note title="Important concepts" emoji="⚠️" variant="info"> |
|
|
In this section, you |
|
|
|
|
|
**Saturation** is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models. |
|
|
<Sidenote> It |
|
|
|
|
|
*If all models have close to the highest possible score on your evaluation, it |
|
|
|
|
|
**Contamination** is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task. |
|
|
|
|
|
*It |
|
|
|
|
|
</Note> |
|
|
|
|
|
### Benchmarks to know in 2025 |
|
|
|
|
|
<EvalsIn2025 /> |
|
|
|
|
|
|
|
|
### Understanding what |
|
|
|
|
|
No matter how you selected your initial datasets, the most important step is, and always will be, to look at the data, both what you have, what the model generates, and its scores. In the end, that |
|
|
|
|
|
You want to study the following. |
|
|
|
|
|
#### Data creation process |
|
|
- **Who created the actual samples?** |
|
|
Ideally, you want dataset created by experts, then next tier is paid annotators, then crowdsourced, then synthetic, then MTurked. You also want to look for a data card, where you |
|
|
|
|
|
- **Were they all examined by other annotators or by the authors?** |
|
|
You want to know if the inter-annotator score on samples is high (= are annotators in agreement?) and/or if the full dataset has been examined by the authors. |
|
|
This is especially important for datasets with the help of underpaid annotators who usually are not native speakers of your target language (think AWS Mechanical Turk), as you might otherwise find typos/grammatical errors/nonsensical answers. |
|
|
|
|
|
- **Were the annotators provided with clear data creation guidelines?** |
|
|
In other words, is your dataset consistent? |
|
|
|
|
|
#### Samples inspection |
|
|
Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you". |
|
|
|
|
|
First, you want to check the content quality. |
|
|
- Are the prompts clear and unambiguous? |
|
|
- Are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*) |
|
|
- Is information missing? (*Eg: MMLU references absent schematics in a number of questions.*) |
|
|
|
|
|
It |
|
|
|
|
|
Then, you want to check for relevance to your task. Are these questions the kind of questions you want to evaluate an LLM on? Are these examples relevant to your use case? |
|
|
|
|
|
You might also want to check the samples consistency (especially if you |
|
|
|
|
|
Lastly, you also want to quickly check how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks). |
|
|
|
|
|
In the below viewer, for example, you can inspect the first samples of well known post-training benchmarks, collected by Lewis. |
|
|
|
|
|
<iframe |
|
|
src="https://huggingface.co/datasets/HuggingFaceTB/post-training-benchmarks-viewer/embed/viewer/aime25/test" |
|
|
frameborder="0" |
|
|
width="100%" |
|
|
height="560px" |
|
|
></iframe> |
|
|
|
|
|
#### Task and metrics |
|
|
|
|
|
You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type. Best (but rarest) metrics are functional or based on rule based verifiers <Sidenote> When doing code evals, beware of too easy pass/fail unit tests! Recent LLMs have become very good at overwriting globals to |
|
|
|
|
|
### So, you can |
|
|
|
|
|
<TroubleshootingReproducibility /> |
|
|
|
|
|
### Selecting good benchmarks automatically for model training |
|
|
|
|
|
<PickingYourEval /> |
|
|
|
|
|
|
|
|
|
|
|
## Creating your own evaluation |
|
|
|
|
|
At this stage, you likely have a good idea of why people do evaluation, which benchmarks exist and are relevant for different model stages (training, inference of base and tuned models), but what if nothing exists for your specific use case? |
|
|
|
|
|
This is precisely when you could want to create your own evaluation. |
|
|
|
|
|
<DesigningAutomaticEvaluation /> |
|
|
|
|
|
## Conclusion |
|
|
|
|
|
Evaluation is both an art and a science. We |
|
|
|
|
|
Key things I hope you |
|
|
|
|
|
**Think critically about what you |
|
|
|
|
|
**Match your evaluation to your goal.** Are you running ablations during training? Use fast, reliable benchmarks with strong signal even on small models. Comparing final models for selection? Focus on harder, uncontaminated datasets that test holistic capabilities. Building for a specific use case? Create custom evaluations that reflect your problems and data. |
|
|
|
|
|
**Reproducibility requires attention to detail.** Small differences in prompts, tokenization, normalization, templates, or random seeds can swing scores by several points. When reporting results, be transparent about your methodology. When trying to reproduce results, expect that exact replication will be extremely challenging even if you attempt to control for every variable. |
|
|
|
|
|
**Prefer interpretable evaluation methods.** When possible, functional testing and rule-based verifiers should be chosen over model judges. Evaluations that can be understood and debugged will provide clearer and more actionable insights... and the more interpretable your evaluation, the more you can improve your models! |
|
|
|
|
|
**Evaluation is never finished.** As models improve, benchmarks saturate. As training data grows, contamination becomes more likely. As use cases evolve, new capabilities need measuring. Evaluation is an ongoing battle! |
|
|
|
|
|
To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading! |
|
|
|
|
|
|
|
|
### Acknowledgments |
|
|
|
|
|
Many thanks to all the people who contributed directly or indirectly to this document, notably Hynek Kydlicek, Loubna Ben Allal, Sander Land and Nathan Habib. |