|
|
# Using the `evaluator` |
|
|
|
|
|
The `Evaluator` classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, `Evaluator`s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section [Using the `evaluator` with custom pipelines](custom_evaluator). |
|
|
|
|
|
Currently supported tasks are: |
|
|
- `"text-classification"`: will use the [`TextClassificationEvaluator`]. |
|
|
- `"token-classification"`: will use the [`TokenClassificationEvaluator`]. |
|
|
- `"question-answering"`: will use the [`QuestionAnsweringEvaluator`]. |
|
|
- `"image-classification"`: will use the [`ImageClassificationEvaluator`]. |
|
|
- `"text-generation"`: will use the [`TextGenerationEvaluator`]. |
|
|
- `"text2text-generation"`: will use the [`Text2TextGenerationEvaluator`]. |
|
|
- `"summarization"`: will use the [`SummarizationEvaluator`]. |
|
|
- `"translation"`: will use the [`TranslationEvaluator`]. |
|
|
- `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`]. |
|
|
- `"audio-classification"`: will use the [`AudioClassificationEvaluator`]. |
|
|
|
|
|
To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s. |
|
|
|
|
|
Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let |
|
|
|
|
|
## Text classification |
|
|
|
|
|
The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs: |
|
|
|
|
|
- `input_column="text"`: with this argument the column with the data for the pipeline can be specified. |
|
|
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. |
|
|
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. |
|
|
|
|
|
By default the `"accuracy"` metric is computed. |
|
|
|
|
|
### Evaluate models on the Hub |
|
|
|
|
|
There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a `transformers` model and pass it to the evaluator or you can pass an initialized `transformers.Pipeline`. Alternatively you can pass any callable function that behaves like a `pipeline` call for the task in any framework. |
|
|
|
|
|
So any of the following works: |
|
|
|
|
|
```py |
|
|
from datasets import load_dataset |
|
|
from evaluate import evaluator |
|
|
from transformers import AutoModelForSequenceClassification, pipeline |
|
|
|
|
|
data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000)) |
|
|
task_evaluator = evaluator("text-classification") |
|
|
|
|
|
# 1. Pass a model name or path |
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline="lvwerra/distilbert-imdb", |
|
|
data=data, |
|
|
label_mapping={"NEGATIVE": 0, "POSITIVE": 1} |
|
|
) |
|
|
|
|
|
# 2. Pass an instantiated model |
|
|
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb") |
|
|
|
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline=model, |
|
|
data=data, |
|
|
label_mapping={"NEGATIVE": 0, "POSITIVE": 1} |
|
|
) |
|
|
|
|
|
# 3. Pass an instantiated pipeline |
|
|
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb") |
|
|
|
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline=pipe, |
|
|
data=data, |
|
|
label_mapping={"NEGATIVE": 0, "POSITIVE": 1} |
|
|
) |
|
|
print(eval_results) |
|
|
``` |
|
|
<Tip> |
|
|
|
|
|
Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device. |
|
|
|
|
|
</Tip> |
|
|
|
|
|
|
|
|
The results will look as follows: |
|
|
```python |
|
|
{ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
} |
|
|
``` |
|
|
|
|
|
Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline. |
|
|
|
|
|
<Tip> |
|
|
|
|
|
The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size. |
|
|
|
|
|
</Tip> |
|
|
|
|
|
### Evaluate multiple metrics |
|
|
|
|
|
With the [`combine`] function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator: |
|
|
|
|
|
```python |
|
|
import evaluate |
|
|
|
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline="lvwerra/distilbert-imdb", |
|
|
data=data, |
|
|
metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]), |
|
|
label_mapping={"NEGATIVE": 0, "POSITIVE": 1} |
|
|
) |
|
|
print(eval_results) |
|
|
|
|
|
``` |
|
|
The results will look as follows: |
|
|
```python |
|
|
{ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
} |
|
|
``` |
|
|
|
|
|
Next let |
|
|
|
|
|
## Token Classification |
|
|
|
|
|
With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments: |
|
|
|
|
|
- `input_column="text"`: with this argument the column with the data for the pipeline can be specified. |
|
|
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. |
|
|
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. |
|
|
- `join_by=" "`: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace. |
|
|
|
|
|
Let |
|
|
|
|
|
### Benchmarking several models |
|
|
|
|
|
Here is an example where several models can be compared thanks to the `evaluator` in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation: |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
from datasets import load_dataset |
|
|
from evaluate import evaluator |
|
|
from transformers import pipeline |
|
|
|
|
|
models = [ |
|
|
"xlm-roberta-large-finetuned-conll03-english", |
|
|
"dbmdz/bert-large-cased-finetuned-conll03-english", |
|
|
"elastic/distilbert-base-uncased-finetuned-conll03-english", |
|
|
"dbmdz/electra-large-discriminator-finetuned-conll03-english", |
|
|
"gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner", |
|
|
"philschmid/distilroberta-base-ner-conll2003", |
|
|
"Jorgeutd/albert-base-v2-finetuned-ner", |
|
|
] |
|
|
|
|
|
data = load_dataset("conll2003", split="validation").shuffle().select(range(1000)) |
|
|
task_evaluator = evaluator("token-classification") |
|
|
|
|
|
results = [] |
|
|
for model in models: |
|
|
results.append( |
|
|
task_evaluator.compute( |
|
|
model_or_pipeline=model, data=data, metric="seqeval" |
|
|
) |
|
|
) |
|
|
|
|
|
df = pd.DataFrame(results, index=models) |
|
|
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]] |
|
|
``` |
|
|
|
|
|
The result is a table that looks like this: |
|
|
|
|
|
| model | overall_f1 | overall_accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | |
|
|
|:-------------------------------------------------------------------|-------------:|-------------------:|------------------------:|---------------------:|---------------------:| |
|
|
| Jorgeutd/albert-base-v2-finetuned-ner | 0.941 | 0.989 | 4.515 | 221.468 | 0.005 | |
|
|
| dbmdz/bert-large-cased-finetuned-conll03-english | 0.962 | 0.881 | 11.648 | 85.850 | 0.012 | |
|
|
| dbmdz/electra-large-discriminator-finetuned-conll03-english | 0.965 | 0.881 | 11.456 | 87.292 | 0.011 | |
|
|
| elastic/distilbert-base-uncased-finetuned-conll03-english | 0.940 | 0.989 | 2.318 | 431.378 | 0.002 | |
|
|
| gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner | 0.947 | 0.991 | 2.376 | 420.873 | 0.002 | |
|
|
| philschmid/distilroberta-base-ner-conll2003 | 0.961 | 0.994 | 2.436 | 410.579 | 0.002 | |
|
|
| xlm-roberta-large-finetuned-conll03-english | 0.969 | 0.882 | 11.996 | 83.359 | 0.012 | |
|
|
|
|
|
|
|
|
### Visualizing results |
|
|
|
|
|
You can feed in the `results` list above into the `plot_radar()` function to visualize different aspects of their performance and choose the model that is the best fit, depending on the metric(s) that are relevant to your use case: |
|
|
|
|
|
```python |
|
|
import evaluate |
|
|
from evaluate.visualization import radar_plot |
|
|
|
|
|
>>> plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"]) |
|
|
>>> plot.show() |
|
|
``` |
|
|
|
|
|
<div class="flex justify-center"> |
|
|
<img src="https://huggingface.co/datasets/evaluate/media/resolve/main/viz.png" width="400"/> |
|
|
</div> |
|
|
|
|
|
|
|
|
Don |
|
|
|
|
|
If you want to save the plot locally, you can use the `plot.savefig()` function with the option `bbox_inches= |
|
|
|
|
|
|
|
|
## Question Answering |
|
|
|
|
|
With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that |
|
|
|
|
|
|
|
|
- `question_column="question"`: the name of the column containing the question in the dataset |
|
|
- `context_column="context"`: the name of the column containing the context |
|
|
- `id_column="id"`: the name of the column cointaing the identification field of the question and answer pair |
|
|
- `label_column="answers"`: the name of the column containing the answers |
|
|
- `squad_v2_format=None`: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred. |
|
|
|
|
|
Let |
|
|
|
|
|
### Confidence intervals |
|
|
|
|
|
Every evaluator comes with the options to compute confidence intervals using [bootstrapping](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). Simply pass `strategy="bootstrap"` and set the number of resanmples with `n_resamples`. |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
from evaluate import evaluator |
|
|
|
|
|
task_evaluator = evaluator("question-answering") |
|
|
|
|
|
data = load_dataset("squad", split="validation[:1000]") |
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline="distilbert-base-uncased-distilled-squad", |
|
|
data=data, |
|
|
metric="squad", |
|
|
strategy="bootstrap", |
|
|
n_resamples=30 |
|
|
) |
|
|
``` |
|
|
|
|
|
Results include confidence intervals as well as error estimates as follows: |
|
|
|
|
|
```python |
|
|
{ |
|
|
|
|
|
{ |
|
|
|
|
|
|
|
|
|
|
|
}, |
|
|
|
|
|
{ |
|
|
|
|
|
|
|
|
|
|
|
}, |
|
|
|
|
|
|
|
|
|
|
|
} |
|
|
``` |
|
|
|
|
|
## Image classification |
|
|
|
|
|
With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier: |
|
|
|
|
|
- `input_column="image"`: the name of the column containing the images as PIL ImageFile |
|
|
- `label_column="label"`: the name of the column containing the labels |
|
|
- `label_mapping=None`: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` |
|
|
|
|
|
Let |
|
|
|
|
|
### Handling large datasets |
|
|
|
|
|
The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB. |
|
|
|
|
|
```python |
|
|
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True) |
|
|
|
|
|
pipe = pipeline( |
|
|
task="image-classification", |
|
|
model="facebook/deit-small-distilled-patch16-224" |
|
|
) |
|
|
|
|
|
task_evaluator = evaluator("image-classification") |
|
|
eval_results = task_evaluator.compute( |
|
|
model_or_pipeline=pipe, |
|
|
data=data, |
|
|
metric="accuracy", |
|
|
label_mapping=pipe.model.config.label2id |
|
|
) |
|
|
``` |
|
|
|
|
|
Since we are using `datasets` to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big. |
|
|
|