|
|
--- |
|
|
title: rbeval |
|
|
emoji: 💩 |
|
|
colorFrom: yellow |
|
|
colorTo: blue |
|
|
sdk: streamlit |
|
|
sdk_version: 1.37.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# RoBustEval Dashboard |
|
|
|
|
|
This dashboard is best viewed at [the huggingface space](https://huggingface.co/spaces/mli-will/rbeval) |
|
|
|
|
|
### Introduction |
|
|
|
|
|
LLM MCQA (multiple choice question-answering) benchmarks are measured in the following way: |
|
|
1. Some number of few shot examples are pulled from the validation set of the MCQA benchmark and formatted as |
|
|
> **Question**: What is the capital of France? \ |
|
|
> (A) Paris \ |
|
|
> (B) London \ |
|
|
> (C) Berlin \ |
|
|
> (D) Madrid \ |
|
|
> **Answer**: A |
|
|
2. The target question is then appended, without the answer, and fed into the model as |
|
|
> **Question**: What is the capital of France? \ |
|
|
> (A) Paris \ |
|
|
> (B) London \ |
|
|
> (C) Berlin \ |
|
|
> (D) Madrid \ |
|
|
> **Answer**: |
|
|
3. The model then outputs it's predictions for the token that should come directly after **Answer**: |
|
|
4. The probabilities $p_i, i \in \{A,B,C,D\}$ for the tokens resulting from tokenizing the strings `"A", "B", "C", "D"` are then recorded |
|
|
5. A question with correct answer $k \in \{A,B,C,D\}$ is marked is correct if $p_k > p_i, i \in \{A,B,C,D\} \setminus \{k\}$ |
|
|
6. The accuracy is reported as the percentage of questions correctly answered |
|
|
|
|
|
This method for evaluation is reasonable, but leaves behind significant amounts of information about model inference. |
|
|
For example, consider a question with correct answer $C$. *The following model outputs are all scored the same*: |
|
|
| Probability | $p_A$ | $p_B$ | $\mathbf{p_C}$ | $p_D$ | |
|
|
|-------------|-----|-----|-----|-----| |
|
|
| Model 1 | 0.00 | 0.00 | **1.00** | 0.00 | |
|
|
| Model 2 | 0.20 | 0.10 | **0.60** | 0.01 | |
|
|
| Model 3 | 0.25 | 0.25 | **0.26** | 0.24 | |
|
|
| Model 4 | 0.0 | 0.0 | **0.01** | 0.0 | |
|
|
|
|
|
In this case, Model 1 is a clear best, with full confidence in the correct answer. |
|
|
Model 2 is also good, but not as good as Model 1. |
|
|
Model 3 is _almost entirely guessing_, and Model 4 doesn't even understand the format of the question, but since the evaluation method only considers the probability of the ABCD tokens, it still gets marked as correct. |
|
|
|
|
|
All of these scenarios result in the exact same score, when maybe, they really shouldn't. |
|
|
|
|
|
**This dashboard is an attempt to address this issue by providing a more detailed analysis of model predictions on MCQA benchmarks.** |
|
|
|
|
|
We will be interested in the following quantities (symbols are chosen to make them easy to type into figures) |
|
|
1. $\Phi = p_{\text{correct}}$, the probability of the correct answer |
|
|
2. $\Delta = p_{\text{correct}} - \max(\{p_i : i \in I\})$ where $I$ is the set of incorrect answers |
|
|
|
|
|
We will then plot the distribution of these quantities for a given model, and compare them across models. |
|
|
|
|
|
Here, $\Delta$ is a measure of how much more confident the model is in the correct answer compared to the most confident incorrect answer, while $p_{\text{correct}}$ is a measure of how confident the model is in the correct answer. |
|
|
|
|
|
An ideal model would have $\Phi = 1$ (and therefore $\Delta=1$) always, while a model that performs random guessing would have $p_i = \Phi = 0.25$ (and therefore $\Delta=0$) always. |
|
|
|
|
|
### Strength as a classifier |
|
|
If we take each question given, we can record two results from it as a correct/incorrect classifier: |
|
|
* Record `y_true=1` with `y_pred=p_{\text{correct}}` |
|
|
* Record `y_true=0` with `y_pred=\max(\{p_i : i \in I\})` |
|
|
We can then measure BCE and ROC-AUC for this classifier, and compare these metrics across models. |
|
|
|
|
|
<!--- |
|
|
### Reading $\Phi$ plots |
|
|
Let's look at an example: MMLU on Llama-7b and Guanaco-7b, an early example of instruction tuning, in the 5-shot setting. |
|
|
|
|
|
 |
|
|
|
|
|
The <span style="color:lightblue">**blue line is Llama-7b**</span> and the <span style="color:tomato">**red line is Guanaco-7b**</span>. There are a few things of note. |
|
|
* The y-axis denotes the percentage of questions which have a $\Phi$ value greater than the corresponding x-axis value. For example, looking at $\Phi=0.25$, we can see ~60% of Llama-7b's questions give correct answers a probability greater than 0.25, corresponding to random guessing. |
|
|
* Both models intersect at roughly $\Phi \approx 0.25$ with ~60% of samples, which means they both perform random guessing or worse for roughly 40% of the questions. |
|
|
* Guanaco-7b has a greater number of samples with low $\Phi$ values, indicating that it's more confident of it's incorrectness. |
|
|
* Similarly, Guanaco-7b has a greater number of samples with *large* $\Phi$ values, indicating that it's more confident of it's correctness as well. |
|
|
|
|
|
### Reading $\Delta$ plots |
|
|
|
|
|
 |
|
|
|
|
|
Again, the <span style="color:lightblue">**blue line is Llama-7b**</span> and the <span style="color:tomato">**red line is Guanaco-7b**</span>. From this plot, we can see a few things: |
|
|
* The 'accuracy' as we defined earlier is the percentage of samples with $\Delta > 0$. We can see this as the intersection of the curves with the vertical line at $\Delta = 0$. We can see that while instruction tuning doesn't seem to have changed the accuracy significantly, it has *vastly* altered the distribution of $\Delta$ values. |
|
|
* Guanaco-7b has a higher percentage of samples with large $\Delta$ values than Llama-7b. For example, in ~12-13% of the samples, Guanaco-7b predicts the correct answer with a probability at least 0.2 greater than the most confident incorrect answer. |
|
|
* Guanaco-7b also has a higher percentage of samples with very low $\Delta$ values. For example, we can read that ~75% of the samples have $\Delta > -0.2$, meaning that ~25% have $\Delta \leq -0.2$. This means that Guanaco-7b predicts the wrong answer with a probability at least 0.2 greater than the correct answer in 25% of the samples, when compared to Llama-7b which only does that in ~6-7% of the samples. |
|
|
--> |
|
|
|
|
|
## How to use this notebook |
|
|
|
|
|
Now that you know what $\Phi$ and $\Delta$ plots are, below I've provided a simple interface to inspect plots for a wide variety of common models. |
|
|
Currently, Llama 1,2,3 7/8B and variations of these models are available to compare. |
|
|
I will be adding more models soon. |
|
|
|
|
|
**Note**: All figures are *fully interactive*. Click/shift-click on the legends to select individual/multiple lines, zoom in and out, and hover over lines to see exact values at each point. |
|
|
**Note**: Some models show strange behaviour in the $\Delta$ plots around $\Delta=0$. This appears to only be in instruction tuned models, and I'm currently investigating the cause. It could be weird fp16 behaviour, but I'm not sure yet. |
|
|
|
|
|
**TODO**: Clean up and explain the model comparison tool below the performance plots. |
|
|
|