evaluation-guidebook / app /src /content /chapters /automated-benchmarks /designing-your-automatic-evaluation.mdx
Clémentine
added epochai's latest report
7816b35
raw
history blame
57.5 kB
---
title: "Designing your automatic evaluation"
---
import Note from "../../../components/Note.astro";
import Sidenote from "../../../components/Sidenote.astro";
import HtmlEmbed from "../../../components/HtmlEmbed.astro";
import Image from "../../../components/Image.astro";
import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
import envImage from '../../assets/image/env.png';
import Wide from "../../../components/Wide.astro";
### Dataset
#### Using existing data
You can use existing datasets are are, and change the prompting or metrics associated (as has been done for older evaluations to adapt them to new prompting method), but you can also aggregate datasets.
Dataset aggregation is a good approach when you want to evaluate a specific capability that isn't well-covered by a single benchmark. Rather than starting from scratch, you can combine samples from multiple existing datasets to create a targeted evaluation suite. That's for examples what the authors of the "Measuring AGI" paper did recently to try to create a new "AGI evaluation" dataset.
When aggregating datasets, pay attention to whether
- they contain redundant data (most mathematics datasets are rewrites or aggregations of the same initial problems)
- you need balanced representation across sources (you might not want one dataset to dominate and skew your evaluation) - this will also determine whether to aggregate scores across all samples or per subset
- formats and difficulty levels are compatible (typically, if creating a unified dataset, beware of mixing up samples requiring sampling or not).
<Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
<UsingHumanAnnotators />
#### Creating a dataset synthetically
**Using rule-based techniques**
If your task allows, using procedurally generated benchmarks is a very good way to get a virtually infinite supply of samples and avoid contamination! They can generate unlimited fresh test cases algorithmically, while controlling difficulty and enabling automatic verification, ensuring models haven't seen examples during training.
For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others. **NPHardEval** generates complexity-grounded tasks like graph problems with automatic verification and monthly refreshes to reduce overfitting. **MuSR** creates complex reasoning instances like 1000-word murder mysteries using neurosymbolic generation. **ZebraLogic** algorithmically produces logic grid puzzles by generating solutions and iteratively minimizing clues using SAT solvers. **BabiQA** simulates entities following successions of actions. **IFEval** tests instruction-following with 500+ prompts containing verifiable constraints like word counts that can be checked programmatically. **GSM-Symbolic** uses templates to generate diverse math questions.
Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
**Creating synthetic data with models**
If you want to create synthetic data, you usually start from a number of seed documents that will act as your ground truth. These can be internal and specific to your use cases, or available on the web and of high quality (like Wikipedia, Stack Overflow, ...). You'll then likely need to chunk your data into units of self contained meaning.
You'll then likely want a model to design questions from your data. For this, you will need to select a frontier model, and design a very good prompt asking the model to create use-case relevant questions from the provided data. It's better if you ask the model to provide the source on which it based its question.
You can also use seed prompts as examples to provide to an external model for it to write the prompt for your model to generate new questions, if you want to go full synthetic ^^
Once this is done, you can do an automatic validation by using a model from a different family line on your ground truth + questions + answer as a model judge.
<Note title="Always make sure that you're checking your data" emoji="⚠️" variant="warning">
No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
</Note>
#### Managing contamination
In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
Solutions to mitigate this include:
- providing a **canary string** in the evaluation set (like in [BigBench](https://github.com/google/BIG-bench)): it is a specific character combination that model creators can look for in their training sets, which would indicate that it contains an evaluation
- providing evaluation sets in **[encrypted](https://arxiv.org/abs/2309.16575) or [gated](https://huggingface.co/datasets/Idavidrein/gpqa)** forms so that they can't be parsed easily by web crawlers - therefore not ending up accidentally in training sets
- running [dynamic benchmarks](https://arxiv.org/abs/2104.14337): benchmarks regularly updated through time so that models can't "learn the answers by heart" (but it makes datasets more costly)
- if you are running a benchmark, trying to [detect contamination](https://arxiv.org/abs/2311.06233) post-hoc (for example, by looking at the generation perplexity or designing adversarial versions of the prompts - however, no method is a foolproof contamination detection method)
However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
<Note>
A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify toxicity on stack overflow after having seen only toxicity on reddit).
</Note>
### Choosing a prompt
The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model. It usually contains the following parts: an optional **task prompt** which introduces the task, and the format that the output should follow, **attached context** if needed (for example a source, an image), a **problem prompt** which is what you ask of the model, and optional options for multiple choice evaluations.
When defining your prompt, you need to be aware that even small changes in semantically equivalent prompts can make the results vary by quite a lot, and prompt formats might advantage or disadvantage specific models (See [this section](https://huggingface.co/spaces/OpenEvals/evaluation-guidebook#different-prompt)).
➡️ This can be mitigated by re-running the evaluation several times with prompt variations (but it can be costly), or simply running your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty.
➡️ You can also provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall.
### Choosing an inference method for your model
You'll need to choose what kind of inference method you need.
<Note title="Reminder about loglikelihood evaluations">
Using log-probabilities is good for multiple choice question answers (MCQA), to test model knowledge, or ability to disambiguate.
- Pros:
- Makes sure that all models have access to the correct answer
- Provides a proxy for model "confidence" (and calibration)
- Fast to evaluate, especially when we ask the model to predict only one token (A/B/C/D the indices of the choices, or Yes/No, etc).
- Allow to get signal on small models' task performance
- Cons:
- Slightly over-scores small models which would have generated something outside of the range of available choices if given free rein.
- Some models [favor specific choices based on the order in which they have been presented](https://arxiv.org/abs/2309.03882), which could lead to unrepresentative evaluations (unless you're re-running the evaluation n times by shuffling samples orders, which you should do for significance if you have the budget for!)
</Note>
<Note title="Tip: an easy speed up for MCQA evaluations" emoji="💡">
You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass.
</Note>
<Note title="Reminder about generative evaluations">
Nowadays most evaluations are generative: using generations is very good for any task where you want to test fluency, reasoning, or the ability of your model to actually answer questions. It's also the most relevant way to evaluate reasoning models.
- Pros:
- Should actually correlates with LLM ability to generate fluent text, will most of the time be what people are actually interested in
- The only way to evaluate both closed and open source models
- Cons:
- Can be harder to score (see below)
- More expensive than log likelihood evaluations, especially if they include sampling or reasoning models
</Note>
### Scoring
If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by sequence length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
## Evaluation's main challenge: Scoring free form text
Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
### Automatically
When there is a ground truth, however, you can use automatic metrics, let's see how.
#### Metrics
Most ways to automatically compare a string of text to a reference are match based.
<Sidenote>This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don't want a model which can only predict text it has already "seen", that would not be very useful! </Sidenote>
The easiest but least flexible match based metrics are **exact matches** of token sequences. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>
The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap. A simpler version of these is the **TER** (translation error rate), number of edits required to go from a prediction to the correct reference (similar to an edit distance).
Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
I'm introducing here the most well known metrics, but all of these metrics have variations and extensions, among which CorpusBLEU, GLEU, MAUVE, METEOR, to cite a few.
<HtmlEmbed src="d3-text-metrics.html" frameless />
Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
If your score is **continuous** (less likely though), you can use **mean squared error** (penalizes large errors but heavily weights outliers) or **mean absolute error** (more balanced than MSE). <Sidenote> If you assume your data should follow a specific linear regression model (for example if you are studying model calibration), you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions). However, it's a bit out of scope here. </Sidenote>
More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
<Note title="To go further">
- This [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/) covers some of the challenges of evaluating LLMs.
- If you're looking for metrics, you'll also find a good list with description, score ranges and use cases in [this organisation](https://huggingface.co/evaluate-metric).
</Note>
<Note title="Pros and cons of using automated metrics">
Automated benchmarks have the following advantages:
- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
- **Understandability**: Most automated metrics are very understandable.
However, they also present have a **reduced use on more complex tasks**: an automatic metric either requires you to have a perfect, unique and unambiguous reference/gold, like for tasks where performance is easy to define and assess (for example, classification of toxicity, knowledge questions with a single answer). More complex capabilities, on the other hand, are harder to decompose into a single and simple answer.
</Note>
#### Normalization
Normalization means changing a string of characters to have it fit a specific reference format. For example, when comparing a model prediction to a reference, you usually don't want to penalize extra spacing in the prediction, or added punctuation or capitalisation. That's why you normalize your prediction.
They are vital for specific tasks, such as math evaluations, where you want to extract an equation from a longer prediction, and compare it to a reference.
In the below table, we make a list of some issues we saw happening when extracting predictions from model outputs using SymPy naively for the MATH dataset, and how Math-Verify, a specific math parser, solved these.
| 📄 Example | ❗️Issue | ✅ Math-Verify | 🛑 Naive Approach |
| --- | --- | --- | --- |
| Therefore, the perimeter of one of these triangles is $14 + 7\sqrt{2}$ inches, expressed in simplest radical form. | Failed extraction | `7*sqrt(2) + 14` | None |
| Therefore, the sum of the infinite geometric series is \(\frac{7}{9}\). | Failed extraction | `7/9` | None |
| The final answer is $2x + 4y + z - 19 = 0$. I hope it is correct. | Partial parse of parametric eq | Eq(2*x + 4*y + z - 19, 0) | 0 |
| \(23\) | Failed extraction due to latex borders | `23` | None |
| \((- \infty, -14) \cup (-3, \infty)\). | Failed extraction due to interval | Union(Interval.open(-oo, -14), Interval.open(-3, oo)) | None |
| 100\% | Failed extraction due to invalid symbol | `1` | None |
| 1/3 == 0.333333 | No rounding support | True | False |
| sqrt(1/2)*7 == sqrt(0.5)*7 | No numerical evaluation support | True | False |
<Sidenote>
Look at [this blog](https://huggingface.co/blog/math_verify_leaderboard) for more details!
</Sidenote>
Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still help provide signal at the task level.
They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
#### Sampling
When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
Common sampling-based metrics are:
- **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for pass@k: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
- **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
- **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
- **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
<HtmlEmbed src="d3-sampling-metrics.html" title="Sampling metrics comparison" />
When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
<Note title="When can you use sampling and when shouldn't you?">
- **For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
- **For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
- **At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
</Note>
#### Functional scorers
Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
**IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
For instance, instructions might specify:
- *"Include exactly 3 bullet points"* → verify the output contains exactly 3 bullets
- *"Capitalize only the first sentence"* → parse and check capitalization patterns
- *"Use the word 'algorithm' at least twice"* → count word occurrences
- *"Your response must be in JSON format with keys 'answer' and 'reasoning'"* → validate JSON structure
Each constraint can be checked with a specific rule-based verifier, making these evaluations more unambiguous, interpretable, fast, and considerably less costly than using models as judges.
This functional approach works particularly well for instruction following, but requires creativity to extend to other text properties. The key is identifying aspects of text that can be verified programmatically rather than through semantic comparison.
<Sidenote>
Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
</Sidenote>
### With humans
Human evaluation is simply asking humans to score predictions.
Human evaluation is very interesting, because of its **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **inherent un-contamination** (if humans write new questions to test your system, they should not be present in your training data, hopefully), and **good correlation with human preference** for obvious reasons.
<Sidenote>
However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
</Sidenote>
Different approaches exist to evaluate models with humans in the loop.
**Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
<HtmlEmbed src="d3-vibe-checks.html"/>
Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).
Vibe-checks are a particularly [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need), as you'll be testing the model on what's relevant to you. Pros of casual human evaluations are that they are cheap and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, they can be prone to blind spots. <Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote>
Once you want to scale to more systematic evaluation with paid annotators, you'll find that there are 3 main ways to do so. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
Overall, however, human evaluation has a number of well known biases, based first impressions, tone, alignement with annotators value, etc, see the figure below.
<HtmlEmbed src="d3-human-biases.html"/>
These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
### With judge models
To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs.
<Sidenote>This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.</Sidenote>
Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
Two approaches exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data (think "spam filter", but for toxicity for example). In the former case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
Model as judges allow to score text on complex and nuanced properties.
For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
They are used on 3 main tasks:
- *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
- *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
- *Computing the similarity* between a model output and a reference
<Sidenote> In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md)) </Sidenote>
#### Pros and cons of using judge-LLMs
People in favor of judge LLMs have been claiming they provide better:
- **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
- **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
- **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
- LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
- They are indeed scalable, but contribute to creating **massive amounts of data** which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
- They are indeed cheap to instantiate, but are not as good as paying actual expert human annotators for your specific use cases.
<HtmlEmbed src="d3-llm-biases.html"/>
This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.
<Note title="Getting started with an LLM judge">
If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) on how to setup your first LLM as judge!
You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
</Note>
#### Getting a Judge-Model
When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4), [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
**Using a generalist LLM**
With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges.
<Note title="Closed vs open source judge models" emoji="⚖️" variant="warning">
**Closed source models (Claude, GPT-o) tradeoffs:**
Disadvantages:
- **Non-reproducible**: Models can change without notice via API updates
- **Black box**: Un-interpretable decision-making
- **Privacy risks**: Data sent to third parties, potential leakage
Advantages:
- Easy access without local setup or hardware requirements
**Open source models are closing the gap** while solving reproducibility and interpretability issues. Models like DeepSeek R1, gpt-oss, and the recent Qwen models are now competitive alternatives.
</Note>
You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
**Using a tiny specialized LLM judge model**
You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
Some existing models as of 2024 were Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset, Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset, and JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models. Newer alternatives surely exist!
**Training your own**
You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche domain).
If you go in that direction, you'll first need to gather preference data for your task of interest, which can come
- From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
- From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can distill into a new smaller model, or quantize, then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data.
<Sidenote> Apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590) </Sidenote>
#### Designing your evaluation prompt
Once you've selected your model, you need to define what is the best possible prompt for your task.
<Note title="Prompt design guidelines" emoji="📝" variant="info">
Provide a clear description of the task at hand:
- *Your task is to do X.*
- *You will be provided with Y.*
Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
- *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
- *You should evaluate if property Z is present in the sample Y. Property Z is present if ...*
Provide some additional "reasoning" evaluation steps:
- *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
Specify the desired output format (adding fields will help consistency)
- *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
</Note>
You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
<Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (*provide 1 point for this characteristic of the answer, 1 additional point if ...* etc)
Using one prompt per capability to score tends to give better and more robust results
</Note>
You can also improve accuracy using the following, possibly more costly, techniques:
- **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
- **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
- **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
- **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
- Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model. It can be made considerably less costly by leveraging many smaller models instead of one big expensive model. You can also experiment with using one model with variations on temperature
- Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
#### Evaluating your evaluator
Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
<Note>
This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference. Models are notoriously bad at predicting on a scale.
</Note>
So, once you have selected your model judge and its prompt, you'll need to do the following.
1. **Pick your baseline**
You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc.
<Note title="Quality over quantity for baseline" emoji="🎯" variant="info">
You don't need many baseline examples (50 can suffice), but they must be:
- **Representative**: Cover the full range of your task
- **Discriminative**: Include edge cases and challenging examples
- **High quality**: Use the best reference data you can obtain
</Note>
2. **Pick your metric**
Your metric will be used to compare your judge's evaluations with your reference.
In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics.
Comparing the correlation of scores with human or model scoring will be harder to do. To understand why in more detail, I advise you to read this cool [blog section on the topic](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator).
In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
3. **Evaluate your evaluator**
For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.
#### Tips and tricks
<Note title="Mitigating well known biases of LLM as judges" emoji="⚠️" variant="warning">
We discussed in this section's [intro](http://localhost:4321/#pros-and-cons-of-using-judge-llms) a number of LLM judges biases. Let's see how you should try to mitigate them.
**Lack of internal consistency**:
➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
**Self-preference**:
➡️ You can mitigate this by using a jury
**Blindness to input perturbation**:
➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
➡️ or providing a coherent grading scale in the prompt.
**Position-bias**:
➡️ switching answer positions randomly
➡️ computing the log-probabilities of all possible choices to get a normalized answer
**Verbosity-bias** (or length-bias):
➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
**Format bias**:
➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
</Note>
**Picking correct tasks for an LLM judge**
LLM evaluators:
- are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))
- have a low to OK-ish correlation with human annotators on [summarization](https://arxiv.org/abs/2304.02554) ([here too](https://arxiv.org/abs/2303.16634)), [faithfulness](https://arxiv.org/abs/2307.16877), and are not consistently correlated with human judgement more broadly against [a scope of tasks](https://arxiv.org/abs/2406.18403)
#### What about Reward Models?
Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference.
Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
The most common type of reward model is the Bradley-Terry model, which outputs a single **pairwise score**, following:
$$p(\text{completion b is better than completion a}) = \text{sigmoid}(\text{score}_b - \text{score}_a)$$
This model is trained using only pairwise comparisons of completions, which are easier to collect than scores, but can only compare several completions for one prompt, and not completions across prompts.
Other models have expanded on this approach to predict a more nuanced probability that a completion is better than the other one ([example](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B)).
This allows them to (theoretically) judge subtle differences between completions, at the cost of not being able to easily save and compare many different scores across prompts for the same test set. In addition, context length and memory limits can become an issue when comparing too long completions.
Some reward models such as [SteerLM](https://arxiv.org/abs/2311.09528) output **absolute scores**, which can be used to evaluate completions directly without the need for pairwise comparisions. These models can be easier to use for evaluation, but are also harder to collect data for, as absolute scores tend to be less stable than pairwise scores in human preferences.
More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
<Note title="How do I use a Reward Model for Evaluation?">
Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
For models that give absolute scores, the resulting scores can be averaged to get a reasonable summary score.
However, in the more common case of relative scores, the average reward can be biased by outliers (a few very good or very bad completions) as different prompts may have inherently different reward scales (some prompts are way harder or easier than others).
<Sidenote>
For relative scores, don't just average raw rewards—outliers and varying prompt difficulty scales will bias results. Use win rates or win probabilities against a reference instead.
</Sidenote>
Instead, we can use
- win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular.
- win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
</Note>
<Note title="Pros and Cons of Reward Models">
Reward models are typically:
- **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
- **Deterministic**: The same scores will be reproduced through the same forward pass
- **Unlikely to suffer from positional bias**: As most models take only one completion, they can not be influenced by the order. For pairwise models, positional bias is often also minimal, as long as the training data was balanced with respect to containing both first and second answers as being the best.
- **Require no prompt engineering**: since the model will simply output a score from one or two completions depending on preference data it's been trained on.
On the other hand they:
- **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
- **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
</Note>
<Note title="Going further">
- A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
- You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper.
- For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
- Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
</Note>
### Constraining model outputs
In a number of cases, we might want the model to output a prediction which follows a very specific format to simplify evaluation.
#### Using a prompt
The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc).
It won't necessarily work all the time but should be good enough for high capability models. That's the approach we followed in the [GAIA](https://huggingface.co/papers/2311.12983) paper for example.
#### Few shots and in context learning
The next way to do so is to constrain the model through what is called "in context learning". By providing examples in the prompt (what is called `few-shot prompting`), the model is implicitly biased towards following the repeated prompt shape for the actual sample.
<Note>
It's a method which was overall working quite well until end of 2023!
However, the widespread adoption of instruction-tuning methods and the addition of instruction data in later stages of model pre-training (continuous pre-training) has biased more recent models towards specific output formats (what is being called [here](https://arxiv.org/abs/2407.07890) *Training on the test task*, and what I would call *overfitting the prompt format*). Reasoning models are also not playing that well with few shot examples because of the reasoning trace.
It's also a method which can be limited for older models with smaller context sizes, as some few-shot examples can not fit into the context window.
</Note>
#### Structured text generation
Structured text generation constrains the outputs to follow a given path, defined by a grammar or by regular expressions, for example. The `outlines` library implements this using finite state machines, which is very neat. (Other approaches exist, such as using interleaved generation for json generation, but the FSM one is my favorite).
To understand more about what happens when using structured generation, you can check the [blog](https://huggingface.co/blog/evaluation-structured-outputs) we wrote together: structured generation reduce prompt variance in evaluation, and make results and rankings more stable. You can also check the overall `outlines` [blog](https://blog.dottxt.co/) for interesting implementations and observations linked to structured generation.
However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show that structured generation can lower model performance on some tasks (like reasoning), by moving the prior too far away from the expected probability distribution.
<Note title="Going further" emoji="📚" variant="warning">
- ⭐ [Understanding how Finite State Machine when using structured generation](https://blog.dottxt.co/coalescence.html), by Outlines. Super clear guide on how their method works!
- [The outlines method paper](https://arxiv.org/abs/2307.09702), a more academic explanation of the above
- [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
</Note>
## The forgotten children of evaluation
### Statistical validity
When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
These confidence intervals from the raw scores can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
### Cost and efficiency
When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 1 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
<div className="card" style="height: fit-content; max-width: 75%; margin: 40px auto;">
<img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="height: auto !important; object-fit: contain !important; display: block; margin: 0 auto;" />
</div>
We suggest you report the following:
- **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
<Sidenote> These cost metrics can also be critical when comparing evaluation methods. For instance, while using a powerful LLM as a judge might provide better signal than automatic metrics, the 100x cost increase may not be justified for all use cases. Similarly, sampling-based metrics (pass@k, maj@n) multiply costs with the number of samples, which should be weighed against the improved signal they provide.</Sidenote>
- **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio