Spaces:
Running
Running
| # Using the `evaluator` | |
| The `Evaluator` classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, `Evaluator`s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section [Using the `evaluator` with custom pipelines](custom_evaluator). | |
| Currently supported tasks are: | |
| - `"text-classification"`: will use the [`TextClassificationEvaluator`]. | |
| - `"token-classification"`: will use the [`TokenClassificationEvaluator`]. | |
| - `"question-answering"`: will use the [`QuestionAnsweringEvaluator`]. | |
| - `"image-classification"`: will use the [`ImageClassificationEvaluator`]. | |
| - `"text-generation"`: will use the [`TextGenerationEvaluator`]. | |
| - `"text2text-generation"`: will use the [`Text2TextGenerationEvaluator`]. | |
| - `"summarization"`: will use the [`SummarizationEvaluator`]. | |
| - `"translation"`: will use the [`TranslationEvaluator`]. | |
| - `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`]. | |
| - `"audio-classification"`: will use the [`AudioClassificationEvaluator`]. | |
| To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s. | |
| Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time. | |
| ## Text classification | |
| The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs: | |
| - `input_column="text"`: with this argument the column with the data for the pipeline can be specified. | |
| - `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. | |
| - `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. | |
| By default the `"accuracy"` metric is computed. | |
| ### Evaluate models on the Hub | |
| There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a `transformers` model and pass it to the evaluator or you can pass an initialized `transformers.Pipeline`. Alternatively you can pass any callable function that behaves like a `pipeline` call for the task in any framework. | |
| So any of the following works: | |
| ```py | |
| from datasets import load_dataset | |
| from evaluate import evaluator | |
| from transformers import AutoModelForSequenceClassification, pipeline | |
| data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000)) | |
| task_evaluator = evaluator("text-classification") | |
| # 1. Pass a model name or path | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline="lvwerra/distilbert-imdb", | |
| data=data, | |
| label_mapping={"NEGATIVE": 0, "POSITIVE": 1} | |
| ) | |
| # 2. Pass an instantiated model | |
| model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb") | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=model, | |
| data=data, | |
| label_mapping={"NEGATIVE": 0, "POSITIVE": 1} | |
| ) | |
| # 3. Pass an instantiated pipeline | |
| pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb") | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=pipe, | |
| data=data, | |
| label_mapping={"NEGATIVE": 0, "POSITIVE": 1} | |
| ) | |
| print(eval_results) | |
| ``` | |
| <Tip> | |
| Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device. | |
| </Tip> | |
| The results will look as follows: | |
| ```python | |
| { | |
| 'accuracy': 0.918, | |
| 'latency_in_seconds': 0.013, | |
| 'samples_per_second': 78.887, | |
| 'total_time_in_seconds': 12.676 | |
| } | |
| ``` | |
| Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline. | |
| <Tip> | |
| The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size. | |
| </Tip> | |
| ### Evaluate multiple metrics | |
| With the [`combine`] function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator: | |
| ```python | |
| import evaluate | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline="lvwerra/distilbert-imdb", | |
| data=data, | |
| metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]), | |
| label_mapping={"NEGATIVE": 0, "POSITIVE": 1} | |
| ) | |
| print(eval_results) | |
| ``` | |
| The results will look as follows: | |
| ```python | |
| { | |
| 'accuracy': 0.918, | |
| 'f1': 0.916, | |
| 'precision': 0.9147, | |
| 'recall': 0.9187, | |
| 'latency_in_seconds': 0.013, | |
| 'samples_per_second': 78.887, | |
| 'total_time_in_seconds': 12.676 | |
| } | |
| ``` | |
| Next let's have a look at token classification. | |
| ## Token Classification | |
| With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments: | |
| - `input_column="text"`: with this argument the column with the data for the pipeline can be specified. | |
| - `label_column="label"`: with this argument the column with the labels for the evaluation can be specified. | |
| - `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels. | |
| - `join_by=" "`: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace. | |
| Let's have a look how we can use the evaluator to benchmark several models. | |
| ### Benchmarking several models | |
| Here is an example where several models can be compared thanks to the `evaluator` in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation: | |
| ```python | |
| import pandas as pd | |
| from datasets import load_dataset | |
| from evaluate import evaluator | |
| from transformers import pipeline | |
| models = [ | |
| "xlm-roberta-large-finetuned-conll03-english", | |
| "dbmdz/bert-large-cased-finetuned-conll03-english", | |
| "elastic/distilbert-base-uncased-finetuned-conll03-english", | |
| "dbmdz/electra-large-discriminator-finetuned-conll03-english", | |
| "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner", | |
| "philschmid/distilroberta-base-ner-conll2003", | |
| "Jorgeutd/albert-base-v2-finetuned-ner", | |
| ] | |
| data = load_dataset("conll2003", split="validation").shuffle().select(range(1000)) | |
| task_evaluator = evaluator("token-classification") | |
| results = [] | |
| for model in models: | |
| results.append( | |
| task_evaluator.compute( | |
| model_or_pipeline=model, data=data, metric="seqeval" | |
| ) | |
| ) | |
| df = pd.DataFrame(results, index=models) | |
| df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]] | |
| ``` | |
| The result is a table that looks like this: | |
| | model | overall_f1 | overall_accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | | |
| |:-------------------------------------------------------------------|-------------:|-------------------:|------------------------:|---------------------:|---------------------:| | |
| | Jorgeutd/albert-base-v2-finetuned-ner | 0.941 | 0.989 | 4.515 | 221.468 | 0.005 | | |
| | dbmdz/bert-large-cased-finetuned-conll03-english | 0.962 | 0.881 | 11.648 | 85.850 | 0.012 | | |
| | dbmdz/electra-large-discriminator-finetuned-conll03-english | 0.965 | 0.881 | 11.456 | 87.292 | 0.011 | | |
| | elastic/distilbert-base-uncased-finetuned-conll03-english | 0.940 | 0.989 | 2.318 | 431.378 | 0.002 | | |
| | gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner | 0.947 | 0.991 | 2.376 | 420.873 | 0.002 | | |
| | philschmid/distilroberta-base-ner-conll2003 | 0.961 | 0.994 | 2.436 | 410.579 | 0.002 | | |
| | xlm-roberta-large-finetuned-conll03-english | 0.969 | 0.882 | 11.996 | 83.359 | 0.012 | | |
| ### Visualizing results | |
| You can feed in the `results` list above into the `plot_radar()` function to visualize different aspects of their performance and choose the model that is the best fit, depending on the metric(s) that are relevant to your use case: | |
| ```python | |
| import evaluate | |
| from evaluate.visualization import radar_plot | |
| >>> plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"]) | |
| >>> plot.show() | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/viz.png" width="400"/> | |
| </div> | |
| Don't forget to specify `invert_range` for metrics for which smaller is better (such as the case for latency in seconds). | |
| If you want to save the plot locally, you can use the `plot.savefig()` function with the option `bbox_inches='tight'`, to make sure no part of the image gets cut off. | |
| ## Question Answering | |
| With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that's required for these models. It has the following specific arguments: | |
| - `question_column="question"`: the name of the column containing the question in the dataset | |
| - `context_column="context"`: the name of the column containing the context | |
| - `id_column="id"`: the name of the column cointaing the identification field of the question and answer pair | |
| - `label_column="answers"`: the name of the column containing the answers | |
| - `squad_v2_format=None`: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred. | |
| Let's have a look how we can evaluate QA models and compute confidence intervals at the same time. | |
| ### Confidence intervals | |
| Every evaluator comes with the options to compute confidence intervals using [bootstrapping](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). Simply pass `strategy="bootstrap"` and set the number of resanmples with `n_resamples`. | |
| ```python | |
| from datasets import load_dataset | |
| from evaluate import evaluator | |
| task_evaluator = evaluator("question-answering") | |
| data = load_dataset("squad", split="validation[:1000]") | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline="distilbert-base-uncased-distilled-squad", | |
| data=data, | |
| metric="squad", | |
| strategy="bootstrap", | |
| n_resamples=30 | |
| ) | |
| ``` | |
| Results include confidence intervals as well as error estimates as follows: | |
| ```python | |
| { | |
| 'exact_match': | |
| { | |
| 'confidence_interval': (79.67, 84.54), | |
| 'score': 82.30, | |
| 'standard_error': 1.28 | |
| }, | |
| 'f1': | |
| { | |
| 'confidence_interval': (85.30, 88.88), | |
| 'score': 87.23, | |
| 'standard_error': 0.97 | |
| }, | |
| 'latency_in_seconds': 0.0085, | |
| 'samples_per_second': 117.31, | |
| 'total_time_in_seconds': 8.52 | |
| } | |
| ``` | |
| ## Image classification | |
| With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier: | |
| - `input_column="image"`: the name of the column containing the images as PIL ImageFile | |
| - `label_column="label"`: the name of the column containing the labels | |
| - `label_mapping=None`: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` | |
| Let's have a look at how can evaluate image classification models on large datasets. | |
| ### Handling large datasets | |
| The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB. | |
| ```python | |
| data = load_dataset("imagenet-1k", split="validation", use_auth_token=True) | |
| pipe = pipeline( | |
| task="image-classification", | |
| model="facebook/deit-small-distilled-patch16-224" | |
| ) | |
| task_evaluator = evaluator("image-classification") | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=pipe, | |
| data=data, | |
| metric="accuracy", | |
| label_mapping=pipe.model.config.label2id | |
| ) | |
| ``` | |
| Since we are using `datasets` to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big. | |