Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / evaluate /main /en /package_reference /evaluator_classes.md

rtrm

about 1 month ago

preview code

download

raw

101 kB

	# Evaluator

	The evaluator classes for automatic evaluation.

	## Evaluator classes[[evaluate.evaluator]]

	The main entry point for using the evaluator:

	#### evaluate.evaluator[[evaluate.evaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/__init__.py#L113)

	Utility factory method to build an [Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator).
	Evaluators encapsulate a task and a default metric name. They leverage `pipeline` functionality from `transformers`
	to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> # Sentiment analysis evaluator
	>>> evaluator("sentiment-analysis")
	```

	Parameters:

	task (`str`) : The task defining which evaluator will be returned. Currently accepted tasks are: - `"image-classification"`: will return a [ImageClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.ImageClassificationEvaluator). - `"question-answering"`: will return a [QuestionAnsweringEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.QuestionAnsweringEvaluator). - `"text-classification"` (alias `"sentiment-analysis"` available): will return a [TextClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.TextClassificationEvaluator). - `"token-classification"`: will return a [TokenClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.TokenClassificationEvaluator).

	Returns:

	`[Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator)`

	An evaluator suitable for the task.

	The base class for all evaluator classes:

	#### evaluate.Evaluator[[evaluate.Evaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L102)

	The [Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator) class is the class from which all evaluators inherit. Refer to this class for methods shared across
	different evaluators.
	Base class implementing evaluator operations.

	check_required_columnsevaluate.Evaluator.check_required_columnshttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L295[{"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset]"}, {"name": "columns_names", "val": ": typing.Dict[str, str]"}]- data (`str` or `Dataset`) --
	Specifies the dataset we will run evaluation on.
	- columns_names (`List[str]`) --
	List of column names to check in the dataset. The keys are the arguments to the [evaluate.EvaluationModule.compute()](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) method,
	while the values are the column names to check.0

	Ensure the columns required for the evaluation are present in the dataset.

	Example:

	```py
	>>> from datasets import load_dataset
	>>> from evaluate import evaluator
	>>> data = load_dataset("rotten_tomatoes', split="train")
	>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})
	```

	Parameters:

	data (`str` or `Dataset`) : Specifies the dataset we will run evaluation on.

	columns_names (`List[str]`) : List of column names to check in the dataset. The keys are the arguments to the [evaluate.EvaluationModule.compute()](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) method, while the values are the column names to check.
	#### compute_metric[[evaluate.Evaluator.compute_metric]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L517)

	Compute and return metrics.
	#### get_dataset_split[[evaluate.Evaluator.get_dataset_split]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L321)

	Infers which split to use if `None` is given.

	Example:

	```py
	>>> from evaluate import evaluator
	>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
	WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
	'test'
	```

	Parameters:

	data (`str`) : Name of dataset.

	subset (`str`) : Name of config for datasets with multiple configurations (e.g. 'glue/cola').

	split (`str`, defaults to `None`) : Split to use.

	Returns:

	``split``

	`str` containing which split to use
	#### load_data[[evaluate.Evaluator.load_data]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L350)

	Load dataset with given subset and split.

	Example:

	```py
	>>> from evaluate import evaluator
	>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
	Dataset({
	features: ['text', 'label'],
	num_rows: 8530
	})
	```

	Parameters:

	data (`Dataset` or `str`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Specifies dataset subset to be passed to `name` in `load_dataset`. To be used with datasets with several configurations (e.g. glue/sst2).

	split (`str`, defaults to `None`) : User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (`test[:n]`). If not defined and data is a `str` type, will automatically select the best one via `choose_split()`.

	Returns:

	`data (`Dataset`)`

	Loaded dataset which will be used for evaluation.
	#### predictions_processor[[evaluate.Evaluator.predictions_processor]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L211)

	A core method of the `Evaluator` class, which processes the pipeline outputs for compatibility with the metric.
	#### prepare_data[[evaluate.Evaluator.prepare_data]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L390)

	Prepare data.

	Example:

	```py
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset

	>>> ds = load_dataset("rotten_tomatoes", split="train")
	>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")
	```

	Parameters:

	data (`Dataset`) : Specifies the dataset we will run evaluation on.

	input_column (`str`, defaults to `"text"`) : The name of the column containing the text feature in the dataset specified by `data`.

	second_input_column(`str`, optional) : The name of the column containing the second text feature if there is one. Otherwise, set to `None`.

	label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.

	Returns:

	``dict``

	metric inputs.
	`list`: pipeline inputs.
	#### prepare_metric[[evaluate.Evaluator.prepare_metric]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L480)

	Prepare metric.

	Example:

	```py
	>>> from evaluate import evaluator
	>>> evaluator("text-classification").prepare_metric("accuracy")
	```

	Parameters:

	metric (`str` or [EvaluationModule](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule), defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	Returns:

	The loaded metric.
	#### prepare_pipeline[[evaluate.Evaluator.prepare_pipeline]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L422)

	Prepare pipeline.

	Example:

	```py
	>>> from evaluate import evaluator
	>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")
	```

	Parameters:

	model_or_pipeline (`str` or [Pipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.Pipeline) or `Callable` or [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel) or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type `str` or is a model instance, we use it to initialize a new [Pipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.Pipeline) with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	preprocessor ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase) or [FeatureExtractionMixin](https://huggingface.co/docs/transformers/main/en/main_classes/feature_extractor#transformers.FeatureExtractionMixin), optional, defaults to `None`) : Argument can be used to overwrite a default preprocessor if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	Returns:

	The initialized pipeline.

	## The task specific evaluators

	### ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

	#### evaluate.ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L49)

	Image classification evaluator.
	This image classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`image-classification`.
	Methods in this class assume a data format compatible with the `ImageClassificationPipeline`.

	computeevaluate.ImageClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L68[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'image'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"image"`) --
	The name of the column containing the images as PIL ImageFile in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	The name of the column containing the labels in the dataset specified by `data`.
	- label_mapping (`Dict[str, Number]`, optional, defaults to `None`) --
	We want to map class labels defined by the model in the pipeline to values consistent with those
	defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("image-classification")
	>>> data = load_dataset("beans", split="test[:40]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="nateraw/vit-base-beans",
	>>> data=data,
	>>> label_column="labels",
	>>> metric="accuracy",
	>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
	>>> strategy="bootstrap"
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"image"`) : The name of the column containing the images as PIL ImageFile in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.

	label_mapping (`Dict[str, Number]`, optional, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

	#### evaluate.QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L75)

	Question answering evaluator. This evaluator handles
	[extractive question answering](https://huggingface.co/docs/transformers/task_summary#extractive-question-answering),
	where the answer to the question is extracted from a context.

	This question answering evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`question-answering`.

	Methods in this class assume a data format compatible with the
	`QuestionAnsweringPipeline`.

	computeevaluate.QuestionAnsweringEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L144[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "question_column", "val": ": str = 'question'"}, {"name": "context_column", "val": ": str = 'context'"}, {"name": "id_column", "val": ": str = 'id'"}, {"name": "label_column", "val": ": str = 'answers'"}, {"name": "squad_v2_format", "val": ": typing.Optional[bool] = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- question_column (`str`, defaults to `"question"`) --
	The name of the column containing the question in the dataset specified by `data`.
	- context_column (`str`, defaults to `"context"`) --
	The name of the column containing the context in the dataset specified by `data`.
	- id_column (`str`, defaults to `"id"`) --
	The name of the column containing the identification field of the question and answer pair in the
	dataset specified by `data`.
	- label_column (`str`, defaults to `"answers"`) --
	The name of the column containing the answers in the dataset specified by `data`.
	- squad_v2_format (`bool`, optional, defaults to `None`) --
	Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset
	has questions where the answer is not in the context, more specifically when are answers as
	`{"text": [], "answer_start": []}` in the answer column. If all questions have at least one answer, this parameter
	should be set to `False`. If this parameter is not provided, the format will be automatically inferred.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("question-answering")
	>>> data = load_dataset("squad", split="validation[:2]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
	>>> data=data,
	>>> metric="squad",
	>>> )
	```

	> [!TIP]
	> Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
	> the compute() call.

	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("question-answering")
	>>> data = load_dataset("squad_v2", split="validation[:2]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
	>>> data=data,
	>>> metric="squad_v2",
	>>> squad_v2_format=True,
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	question_column (`str`, defaults to `"question"`) : The name of the column containing the question in the dataset specified by `data`.

	context_column (`str`, defaults to `"context"`) : The name of the column containing the context in the dataset specified by `data`.

	id_column (`str`, defaults to `"id"`) : The name of the column containing the identification field of the question and answer pair in the dataset specified by `data`.

	label_column (`str`, defaults to `"answers"`) : The name of the column containing the answers in the dataset specified by `data`.

	squad_v2_format (`bool`, optional, defaults to `None`) : Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as `{"text": [], "answer_start": []}` in the answer column. If all questions have at least one answer, this parameter should be set to `False`. If this parameter is not provided, the format will be automatically inferred.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]

	#### evaluate.TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L51)

	Text classification evaluator.
	This text classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`text-classification` or with a `"sentiment-analysis"` alias.
	Methods in this class assume a data format compatible with the [TextClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline) - a single textual
	feature as input and a categorical label as output.

	computeevaluate.TextClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L89[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "second_input_column", "val": ": typing.Optional[str] = None"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, optional, defaults to `"text"`) --
	The name of the column containing the text feature in the dataset specified by `data`.
	- second_input_column (`str`, optional, defaults to `None`) --
	The name of the second column containing the text features. This may be useful for classification tasks
	as MNLI, where two columns are used.
	- label_column (`str`, defaults to `"label"`) --
	The name of the column containing the labels in the dataset specified by `data`.
	- label_mapping (`Dict[str, Number]`, optional, defaults to `None`) --
	We want to map class labels defined by the model in the pipeline to values consistent with those
	defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("text-classification")
	>>> data = load_dataset("imdb", split="test[:2]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
	>>> data=data,
	>>> metric="accuracy",
	>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
	>>> strategy="bootstrap",
	>>> n_resamples=10,
	>>> random_state=0
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, optional, defaults to `"text"`) : The name of the column containing the text feature in the dataset specified by `data`.

	second_input_column (`str`, optional, defaults to `None`) : The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.

	label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.

	label_mapping (`Dict[str, Number]`, optional, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]

	#### evaluate.TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L84)

	Token classification evaluator.

	This token classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`token-classification`.

	Methods in this class assume a data format compatible with the [TokenClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline).

	computeevaluate.TokenClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L212[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": str = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'tokens'"}, {"name": "label_column", "val": ": str = 'ner_tags'"}, {"name": "join_by", "val": ": typing.Optional[str] = ' '"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"tokens"`) --
	The name of the column containing the tokens feature in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	The name of the column containing the labels in the dataset specified by `data`.
	- join_by (`str`, optional, defaults to `" "`) --
	This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join
	words to generate a string input. This is especially useful for languages that do not separate words by a space.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following [conll2003 dataset](https://huggingface.co/datasets/conll2003). Datasets whose inputs are single strings, and labels are a list of offset are not supported.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("token-classification")
	>>> data = load_dataset("conll2003", split="validation[:2]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
	>>> data=data,
	>>> metric="seqeval",
	>>> )
	```

	> [!TIP]
	> For example, the following dataset format is accepted by the evaluator:
	>
	> ```python
	> dataset = Dataset.from_dict(
	> mapping={
	> "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
	> "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
	> },
	> features=Features({
	> "tokens": Sequence(feature=Value(dtype="string")),
	> "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
	> }),
	> )
	> ```

	> [!WARNING]
	> For example, the following dataset format is not accepted by the evaluator:
	>
	> ```python
	> dataset = Dataset.from_dict(
	> mapping={
	> "tokens": [["New York is a city and Felix a person."]],
	> "starts": [[0, 23]],
	> "ends": [[7, 27]],
	> "ner_tags": [["LOC", "PER"]],
	> },
	> features=Features({
	> "tokens": Value(dtype="string"),
	> "starts": Sequence(feature=Value(dtype="int32")),
	> "ends": Sequence(feature=Value(dtype="int32")),
	> "ner_tags": Sequence(feature=Value(dtype="string")),
	> }),
	> )
	> ```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"tokens"`) : The name of the column containing the tokens feature in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.

	join_by (`str`, optional, defaults to `" "`) : This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]

	#### evaluate.TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_generation.py#L31)

	Text generation evaluator.
	This Text generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`text-generation`.
	Methods in this class assume a data format compatible with the [TextGenerationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextGenerationPipeline).

	computeevaluate.TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L218[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]

	### Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]

	#### evaluate.Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L88)

	Text2Text generation evaluator.
	This Text2Text generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`text2text-generation`.
	Methods in this class assume a data format compatible with the `Text2TextGenerationPipeline`.

	computeevaluate.Text2TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L105[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"text"`) --
	the name of the column containing the input text in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	the name of the column containing the labels in the dataset specified by `data`.
	- generation_kwargs (`Dict`, optional, defaults to `None`) --
	The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("text2text-generation")
	>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="facebook/bart-large-cnn",
	>>> data=data,
	>>> input_column="article",
	>>> label_column="highlights",
	>>> metric="rouge",
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.

	generation_kwargs (`Dict`, optional, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### SummarizationEvaluator[[evaluate.SummarizationEvaluator]]

	#### evaluate.SummarizationEvaluator[[evaluate.SummarizationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L152)

	Text summarization evaluator.
	This text summarization evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`summarization`.
	Methods in this class assume a data format compatible with the [SummarizationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.SummarizationEvaluator).

	computeevaluate.SummarizationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L166[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"text"`) --
	the name of the column containing the input text in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	the name of the column containing the labels in the dataset specified by `data`.
	- generation_kwargs (`Dict`, optional, defaults to `None`) --
	The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("summarization")
	>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="facebook/bart-large-cnn",
	>>> data=data,
	>>> input_column="article",
	>>> label_column="highlights",
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.

	generation_kwargs (`Dict`, optional, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### TranslationEvaluator[[evaluate.TranslationEvaluator]]

	#### evaluate.TranslationEvaluator[[evaluate.TranslationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L211)

	Translation evaluator.
	This translation generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`translation`.
	Methods in this class assume a data format compatible with the `TranslationPipeline`.

	computeevaluate.TranslationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L225[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"text"`) --
	the name of the column containing the input text in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	the name of the column containing the labels in the dataset specified by `data`.
	- generation_kwargs (`Dict`, optional, defaults to `None`) --
	The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("translation")
	>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
	>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
	>>> data=data,
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.

	generation_kwargs (`Dict`, optional, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]

	#### evaluate.AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L47)

	Automatic speech recognition evaluator.
	This automatic speech recognition evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`automatic-speech-recognition`.
	Methods in this class assume a data format compatible with the `AutomaticSpeechRecognitionPipeline`.

	computeevaluate.AutomaticSpeechRecognitionEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L63[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'path'"}, {"name": "label_column", "val": ": str = 'sentence'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"path"`) --
	the name of the column containing the input audio path in the dataset specified by `data`.
	- label_column (`str`, defaults to `"sentence"`) --
	the name of the column containing the labels in the dataset specified by `data`.
	- generation_kwargs (`Dict`, optional, defaults to `None`) --
	The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:
	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset
	>>> task_evaluator = evaluator("automatic-speech-recognition")
	>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
	>>> data=data,
	>>> input_column="path",
	>>> label_column="sentence",
	>>> metric="wer",
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"path"`) : the name of the column containing the input audio path in the dataset specified by `data`.

	label_column (`str`, defaults to `"sentence"`) : the name of the column containing the labels in the dataset specified by `data`.

	generation_kwargs (`Dict`, optional, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	### AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]

	#### evaluate.AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]

	[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L75)

	Audio classification evaluator.
	This audio classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
	`audio-classification`.
	Methods in this class assume a data format compatible with the [transformers.AudioClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.AudioClassificationPipeline).

	computeevaluate.AudioClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L94[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'file'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
	If the argument in not specified, we initialize the default pipeline for the task (in this case
	`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
	is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
	argument specifies a pre-initialized pipeline.
	- data (`str` or `Dataset`, defaults to `None`) --
	Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
	name, and load it. Otherwise we assume it represents a pre-loaded dataset.
	- subset (`str`, defaults to `None`) --
	Defines which dataset subset to load. If `None` is passed the default subset is loaded.
	- split (`str`, defaults to `None`) --
	Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
	- metric (`str` or `EvaluationModule`, defaults to `None`) --
	Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
	load it. Otherwise we assume it represents a pre-loaded metric.
	- tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) --
	Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
	which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
	this argument.
	- strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") --
	specifies the evaluation strategy. Possible values are:
	- `"simple"` - we evaluate the metric and return the scores.
	- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
	of the returned metric keys, using `scipy`'s `bootstrap` method
	https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
	- confidence_level (`float`, defaults to `0.95`) --
	The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- n_resamples (`int`, defaults to `9999`) --
	The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
	- device (`int`, defaults to `None`) --
	Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
	integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
	CUDA:0 used if available, CPU otherwise.
	- random_state (`int`, optional, defaults to `None`) --
	The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
	debugging.

	- input_column (`str`, defaults to `"file"`) --
	The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by `data`.
	- label_column (`str`, defaults to `"label"`) --
	The name of the column containing the labels in the dataset specified by `data`.
	- label_mapping (`Dict[str, Number]`, optional, defaults to `None`) --
	We want to map class labels defined by the model in the pipeline to values consistent with those
	defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

	Compute the metric for a given pipeline and dataset combination.

	Examples:

	> [!TIP]
	> Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)

	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset

	>>> task_evaluator = evaluator("audio-classification")
	>>> data = load_dataset("superb", 'ks', split="test[:40]")
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
	>>> data=data,
	>>> label_column="label",
	>>> input_column="file",
	>>> metric="accuracy",
	>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
	>>> )
	```

	> [!TIP]
	> The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
	> the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.

	```python
	>>> from evaluate import evaluator
	>>> from datasets import load_dataset

	>>> task_evaluator = evaluator("audio-classification")
	>>> data = load_dataset("superb", 'ks', split="test[:40]")
	>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
	>>> results = task_evaluator.compute(
	>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
	>>> data=data,
	>>> label_column="label",
	>>> input_column="audio",
	>>> metric="accuracy",
	>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
	>>> )
	```

	Parameters:

	model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.

	data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.

	subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.

	split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.

	metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

	tokenizer (`str` or `PreTrainedTokenizer`, optional, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.

	strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.

	confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.

	device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.

	random_state (`int`, optional, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.

	input_column (`str`, defaults to `"file"`) : The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by `data`.

	label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.

	label_mapping (`Dict[str, Number]`, optional, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.

	Returns:

	A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
	`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
	containing the score, the confidence interval and the standard error calculated for each metric key.

Xet Storage Details

Size:: 101 kB
Xet hash:: 76399bcd007d5fe0ed3c88cec6516214bcd139fc124eaebfdbcdf87aeadc6000

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.