Buckets:

hf-doc-build/doc / evaluate /main /en /package_reference /evaluator_classes.md
rtrm's picture
|
download
raw
101 kB

Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes[[evaluate.evaluator]]

The main entry point for using the evaluator:

evaluate.evaluator[[evaluate.evaluator]]

Source

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionality from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

Parameters:

task (str) : The task defining which evaluator will be returned. Currently accepted tasks are: - "image-classification": will return a ImageClassificationEvaluator. - "question-answering": will return a QuestionAnsweringEvaluator. - "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator. - "token-classification": will return a TokenClassificationEvaluator.

Returns:

[Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator)

An evaluator suitable for the task.

The base class for all evaluator classes:

evaluate.Evaluator[[evaluate.Evaluator]]

Source

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.

check_required_columnsevaluate.Evaluator.check_required_columnshttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L295[{"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset]"}, {"name": "columns_names", "val": ": typing.Dict[str, str]"}]- data (str or Dataset) -- Specifies the dataset we will run evaluation on.

  • columns_names (List[str]) -- List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.0

Ensure the columns required for the evaluation are present in the dataset.

Example:

>>> from datasets import load_dataset
>>> from evaluate import evaluator
>>> data = load_dataset("rotten_tomatoes', split="train")
>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})

Parameters:

data (str or Dataset) : Specifies the dataset we will run evaluation on.

columns_names (List[str]) : List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.

compute_metric[[evaluate.Evaluator.compute_metric]]

Source

Compute and return metrics.

get_dataset_split[[evaluate.Evaluator.get_dataset_split]]

Source

Infers which split to use if None is given.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
'test'

Parameters:

data (str) : Name of dataset.

subset (str) : Name of config for datasets with multiple configurations (e.g. 'glue/cola').

split (str, defaults to None) : Split to use.

Returns:

split

str containing which split to use

load_data[[evaluate.Evaluator.load_data]]

Source

Load dataset with given subset and split.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

Parameters:

data (Dataset or str, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Specifies dataset subset to be passed to name in load_dataset. To be used with datasets with several configurations (e.g. glue/sst2).

split (str, defaults to None) : User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is a str type, will automatically select the best one via choose_split().

Returns:

data (Dataset)

Loaded dataset which will be used for evaluation.

predictions_processor[[evaluate.Evaluator.predictions_processor]]

Source

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.

prepare_data[[evaluate.Evaluator.prepare_data]]

Source

Prepare data.

Example:

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> ds = load_dataset("rotten_tomatoes", split="train")
>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")

Parameters:

data (Dataset) : Specifies the dataset we will run evaluation on.

input_column (str, defaults to "text") : The name of the column containing the text feature in the dataset specified by data.

second_input_column(str, optional) : The name of the column containing the second text feature if there is one. Otherwise, set to None.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

Returns:

dict

metric inputs. list: pipeline inputs.

prepare_metric[[evaluate.Evaluator.prepare_metric]]

Source

Prepare metric.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_metric("accuracy")

Parameters:

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Returns:

The loaded metric.

prepare_pipeline[[evaluate.Evaluator.prepare_pipeline]]

Source

Prepare pipeline.

Example:

>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) : Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Returns:

The initialized pipeline.

The task specific evaluators

ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

evaluate.ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

Source

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.

computeevaluate.ImageClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L68[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'image'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "image") -- The name of the column containing the images as PIL ImageFile in the dataset specified by data.

  • label_column (str, defaults to "label") -- The name of the column containing the labels in the dataset specified by data.

  • label_mapping (Dict[str, Number], optional, defaults to None) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "image") : The name of the column containing the images as PIL ImageFile in the dataset specified by data.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

evaluate.QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

Source

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.

computeevaluate.QuestionAnsweringEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L144[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "question_column", "val": ": str = 'question'"}, {"name": "context_column", "val": ": str = 'context'"}, {"name": "id_column", "val": ": str = 'id'"}, {"name": "label_column", "val": ": str = 'answers'"}, {"name": "squad_v2_format", "val": ": typing.Optional[bool] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • question_column (str, defaults to "question") -- The name of the column containing the question in the dataset specified by data.

  • context_column (str, defaults to "context") -- The name of the column containing the context in the dataset specified by data.

  • id_column (str, defaults to "id") -- The name of the column containing the identification field of the question and answer pair in the dataset specified by data.

  • label_column (str, defaults to "answers") -- The name of the column containing the answers in the dataset specified by data.

  • squad_v2_format (bool, optional, defaults to None) -- Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as {"text": [], "answer_start": []} in the answer column. If all questions have at least one answer, this parameter should be set to False. If this parameter is not provided, the format will be automatically inferred.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

question_column (str, defaults to "question") : The name of the column containing the question in the dataset specified by data.

context_column (str, defaults to "context") : The name of the column containing the context in the dataset specified by data.

id_column (str, defaults to "id") : The name of the column containing the identification field of the question and answer pair in the dataset specified by data.

label_column (str, defaults to "answers") : The name of the column containing the answers in the dataset specified by data.

squad_v2_format (bool, optional, defaults to None) : Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as {"text": [], "answer_start": []} in the answer column. If all questions have at least one answer, this parameter should be set to False. If this parameter is not provided, the format will be automatically inferred.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]

evaluate.TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]

Source

Text classification evaluator. This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias. Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

computeevaluate.TextClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L89[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "second_input_column", "val": ": typing.Optional[str] = None"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, optional, defaults to "text") -- The name of the column containing the text feature in the dataset specified by data.

  • second_input_column (str, optional, defaults to None) -- The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.

  • label_column (str, defaults to "label") -- The name of the column containing the labels in the dataset specified by data.

  • label_mapping (Dict[str, Number], optional, defaults to None) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, optional, defaults to "text") : The name of the column containing the text feature in the dataset specified by data.

second_input_column (str, optional, defaults to None) : The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]

evaluate.TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]

Source

Token classification evaluator.

This token classification evaluator can currently be loaded from evaluator() using the default task name token-classification.

Methods in this class assume a data format compatible with the TokenClassificationPipeline.

computeevaluate.TokenClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L212[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": str = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'tokens'"}, {"name": "label_column", "val": ": str = 'ner_tags'"}, {"name": "join_by", "val": ": typing.Optional[str] = ' '"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "tokens") -- The name of the column containing the tokens feature in the dataset specified by data.

  • label_column (str, defaults to "label") -- The name of the column containing the labels in the dataset specified by data.

  • join_by (str, optional, defaults to " ") -- This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )

For example, the following dataset format is accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)

For example, the following dataset format is not accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "tokens") : The name of the column containing the tokens feature in the dataset specified by data.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

join_by (str, optional, defaults to " ") : This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]

evaluate.TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]

Source

Text generation evaluator. This Text generation evaluator can currently be loaded from evaluator() using the default task name text-generation. Methods in this class assume a data format compatible with the TextGenerationPipeline.

computeevaluate.TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L218[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]

Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]

evaluate.Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]

Source

Text2Text generation evaluator. This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name text2text-generation. Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.

computeevaluate.Text2TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L105[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "text") -- the name of the column containing the input text in the dataset specified by data.

  • label_column (str, defaults to "label") -- the name of the column containing the labels in the dataset specified by data.

  • generation_kwargs (Dict, optional, defaults to None) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text2text-generation")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="facebook/bart-large-cnn",
>>>     data=data,
>>>     input_column="article",
>>>     label_column="highlights",
>>>     metric="rouge",
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.

label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.

generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

SummarizationEvaluator[[evaluate.SummarizationEvaluator]]

evaluate.SummarizationEvaluator[[evaluate.SummarizationEvaluator]]

Source

Text summarization evaluator. This text summarization evaluator can currently be loaded from evaluator() using the default task name summarization. Methods in this class assume a data format compatible with the SummarizationEvaluator.

computeevaluate.SummarizationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L166[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "text") -- the name of the column containing the input text in the dataset specified by data.

  • label_column (str, defaults to "label") -- the name of the column containing the labels in the dataset specified by data.

  • generation_kwargs (Dict, optional, defaults to None) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("summarization")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="facebook/bart-large-cnn",
>>>     data=data,
>>>     input_column="article",
>>>     label_column="highlights",
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.

label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.

generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

TranslationEvaluator[[evaluate.TranslationEvaluator]]

evaluate.TranslationEvaluator[[evaluate.TranslationEvaluator]]

Source

Translation evaluator. This translation generation evaluator can currently be loaded from evaluator() using the default task name translation. Methods in this class assume a data format compatible with the TranslationPipeline.

computeevaluate.TranslationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L225[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "text") -- the name of the column containing the input text in the dataset specified by data.

  • label_column (str, defaults to "label") -- the name of the column containing the labels in the dataset specified by data.

  • generation_kwargs (Dict, optional, defaults to None) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("translation")
>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
>>>     data=data,
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.

label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.

generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]

evaluate.AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]

Source

Automatic speech recognition evaluator. This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name automatic-speech-recognition. Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline.

computeevaluate.AutomaticSpeechRecognitionEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L63[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'path'"}, {"name": "label_column", "val": ": str = 'sentence'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "path") -- the name of the column containing the input audio path in the dataset specified by data.

  • label_column (str, defaults to "sentence") -- the name of the column containing the labels in the dataset specified by data.

  • generation_kwargs (Dict, optional, defaults to None) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>>     data=data,
>>>     input_column="path",
>>>     label_column="sentence",
>>>     metric="wer",
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "path") : the name of the column containing the input audio path in the dataset specified by data.

label_column (str, defaults to "sentence") : the name of the column containing the labels in the dataset specified by data.

generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]

evaluate.AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]

Source

Audio classification evaluator. This audio classification evaluator can currently be loaded from evaluator() using the default task name audio-classification. Methods in this class assume a data format compatible with the transformers.AudioClassificationPipeline.

computeevaluate.AudioClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L94[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'file'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) -- If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

  • data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

  • subset (str, defaults to None) -- Defines which dataset subset to load. If None is passed the default subset is loaded.

  • split (str, defaults to None) -- Defines which dataset split to load. If None is passed, infers based on the choose_split function.

  • metric (str or EvaluationModule, defaults to None) -- Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) -- Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

  • strategy (Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) -- The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

  • n_resamples (int, defaults to 9999) -- The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

  • device (int, defaults to None) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

  • random_state (int, optional, defaults to None) -- The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

  • input_column (str, defaults to "file") -- The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by data.

  • label_column (str, defaults to "label") -- The name of the column containing the labels in the dataset specified by data.

  • label_mapping (Dict[str, Number], optional, defaults to None) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.0A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Compute the metric for a given pipeline and dataset combination.

Examples:

Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>>     data=data,
>>>     label_column="label",
>>>     input_column="file",
>>>     metric="accuracy",
>>>     label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )

The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.

>>> from evaluate import evaluator
>>> from datasets import load_dataset

>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
>>> results = task_evaluator.compute(
>>>     model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>>     data=data,
>>>     label_column="label",
>>>     input_column="audio",
>>>     metric="accuracy",
>>>     label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )

Parameters:

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.

split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.

metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.

n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.

device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

input_column (str, defaults to "file") : The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by data.

label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.

label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Returns:

A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the "simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict containing the score, the confidence interval and the standard error calculated for each metric key.

Xet Storage Details

Size:
101 kB
·
Xet hash:
76399bcd007d5fe0ed3c88cec6516214bcd139fc124eaebfdbcdf87aeadc6000

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.