hf-doc-build/doc / evaluate /main /en /package_reference /evaluator_classes.md

|

raw

101 kB

Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes[[evaluate.evaluator]]

The main entry point for using the evaluator:

evaluate.evaluator[[evaluate.evaluator]]

Source

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionality from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

Parameters:

task (str) : The task defining which evaluator will be returned. Currently accepted tasks are: - "image-classification": will return a ImageClassificationEvaluator. - "question-answering": will return a QuestionAnsweringEvaluator. - "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator. - "token-classification": will return a TokenClassificationEvaluator.

Returns:

[Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator)

An evaluator suitable for the task.

The base class for all evaluator classes:

evaluate.Evaluator[[evaluate.Evaluator]]

Source

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.

check_required_columnsevaluate.Evaluator.check_required_columnshttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L295[{"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset]"}, {"name": "columns_names", "val": ": typing.Dict[str, str]"}]- data (str or Dataset) -- Specifies the dataset we will run evaluation on.

columns_names (List[str]) -- List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.0

Ensure the columns required for the evaluation are present in the dataset.

Example:

>>> from datasets import load_dataset
>>> from evaluate import evaluator
>>> data = load_dataset("rotten_tomatoes', split="train")
>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})

Parameters:

data (str or Dataset) : Specifies the dataset we will run evaluation on.

columns_names (List[str]) : List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.

compute_metric[[evaluate.Evaluator.compute_metric]]

Source

Compute and return metrics.

get_dataset_split[[evaluate.Evaluator.get_dataset_split]]

Source

Infers which split to use if None is given.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
'test'

Parameters:

data (str) : Name of dataset.

subset (str) : Name of config for datasets with multiple configurations (e.g. 'glue/cola').

split (str, defaults to None) : Split to use.

Returns:

split

str containing which split to use

load_data[[evaluate.Evaluator.load_data]]

Source

Load dataset with given subset and split.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

Parameters:

data (Dataset or str, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Specifies dataset subset to be passed to name in load_dataset. To be used with datasets with several configurations (e.g. glue/sst2).

split (str, defaults to None) : User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is a str type, will automatically select the best one via choose_split().

Returns:

data (Dataset)

Loaded dataset which will be used for evaluation.

predictions_processor[[evaluate.Evaluator.predictions_processor]]

Source

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.

prepare_data[[evaluate.Evaluator.prepare_data]]

Source

Prepare data.

Example:

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> ds = load_dataset("rotten_tomatoes", split="train")
>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")

Parameters:

data (Dataset) : Specifies the dataset we will run evaluation on.

input_column (str, defaults to "text") : The name of the column containing the text feature in the dataset specified by data.

second_input_column(str, optional) : The name of the column containing the second text feature if there is one. Otherwise, set to None.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

Returns:

dict

metric inputs. list: pipeline inputs.

prepare_metric[[evaluate.Evaluator.prepare_metric]]

Source

Prepare metric.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_metric("accuracy")

Parameters:

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Returns:

The loaded metric.

prepare_pipeline[[evaluate.Evaluator.prepare_pipeline]]

Source

Prepare pipeline.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) : Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Returns:

The initialized pipeline.

The task specific evaluators

ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

evaluate.ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

Source

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.

computeevaluate.ImageClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L68[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'image'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "image") -- The name of the column containing the images as PIL ImageFile in the dataset specified by data.
label_column (str, defaults to "label") -- The name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "image") : The name of the column containing the images as PIL ImageFile in the dataset specified by data.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

evaluate.QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

Source

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.

computeevaluate.QuestionAnsweringEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L144[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "question_column", "val": ": str = 'question'"}, {"name": "context_column", "val": ": str = 'context'"}, {"name": "id_column", "val": ": str = 'id'"}, {"name": "label_column", "val": ": str = 'answers'"}, {"name": "squad_v2_format", "val": ": typing.Optional[bool] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
question_column (str, defaults to "question") -- The name of the column containing the question in the dataset specified by data.
context_column (str, defaults to "context") -- The name of the column containing the context in the dataset specified by data.
id_column (str, defaults to "id") -- The name of the column containing the identification field of the question and answer pair in the dataset specified by data.
label_column (str, defaults to "answers") -- The name of the column containing the answers in the dataset specified by data.
squad_v2_format (bool, optional, defaults to None) -- Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as {"text": [], "answer_start": []} in the answer column. If all questions have at least one answer, this parameter should be set to False. If this parameter is not provided, the format will be automatically inferred.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )