Buckets:

hf-doc-build/doc / evaluate /main /en /package_reference /evaluator_classes.md
rtrm's picture
|
download
raw
101 kB
# Evaluator
The evaluator classes for automatic evaluation.
## Evaluator classes[[evaluate.evaluator]]
The main entry point for using the evaluator:
#### evaluate.evaluator[[evaluate.evaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/__init__.py#L113)
Utility factory method to build an [Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator).
Evaluators encapsulate a task and a default metric name. They leverage `pipeline` functionality from `transformers`
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
Examples:
```python
>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")
```
**Parameters:**
task (`str`) : The task defining which evaluator will be returned. Currently accepted tasks are: - `"image-classification"`: will return a [ImageClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.ImageClassificationEvaluator). - `"question-answering"`: will return a [QuestionAnsweringEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.QuestionAnsweringEvaluator). - `"text-classification"` (alias `"sentiment-analysis"` available): will return a [TextClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.TextClassificationEvaluator). - `"token-classification"`: will return a [TokenClassificationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.TokenClassificationEvaluator).
**Returns:**
`[Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator)`
An evaluator suitable for the task.
The base class for all evaluator classes:
#### evaluate.Evaluator[[evaluate.Evaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L102)
The [Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator) class is the class from which all evaluators inherit. Refer to this class for methods shared across
different evaluators.
Base class implementing evaluator operations.
check_required_columnsevaluate.Evaluator.check_required_columnshttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L295[{"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset]"}, {"name": "columns_names", "val": ": typing.Dict[str, str]"}]- **data** (`str` or `Dataset`) --
Specifies the dataset we will run evaluation on.
- **columns_names** (`List[str]`) --
List of column names to check in the dataset. The keys are the arguments to the [evaluate.EvaluationModule.compute()](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) method,
while the values are the column names to check.0
Ensure the columns required for the evaluation are present in the dataset.
Example:
```py
>>> from datasets import load_dataset
>>> from evaluate import evaluator
>>> data = load_dataset("rotten_tomatoes', split="train")
>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})
```
**Parameters:**
data (`str` or `Dataset`) : Specifies the dataset we will run evaluation on.
columns_names (`List[str]`) : List of column names to check in the dataset. The keys are the arguments to the [evaluate.EvaluationModule.compute()](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) method, while the values are the column names to check.
#### compute_metric[[evaluate.Evaluator.compute_metric]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L517)
Compute and return metrics.
#### get_dataset_split[[evaluate.Evaluator.get_dataset_split]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L321)
Infers which split to use if `None` is given.
Example:
```py
>>> from evaluate import evaluator
>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
'test'
```
**Parameters:**
data (`str`) : Name of dataset.
subset (`str`) : Name of config for datasets with multiple configurations (e.g. 'glue/cola').
split (`str`, defaults to `None`) : Split to use.
**Returns:**
``split``
`str` containing which split to use
#### load_data[[evaluate.Evaluator.load_data]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L350)
Load dataset with given subset and split.
Example:
```py
>>> from evaluate import evaluator
>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
Dataset({
features: ['text', 'label'],
num_rows: 8530
})
```
**Parameters:**
data (`Dataset` or `str`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Specifies dataset subset to be passed to `name` in `load_dataset`. To be used with datasets with several configurations (e.g. glue/sst2).
split (`str`, defaults to `None`) : User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (`test[:n]`). If not defined and data is a `str` type, will automatically select the best one via `choose_split()`.
**Returns:**
`data (`Dataset`)`
Loaded dataset which will be used for evaluation.
#### predictions_processor[[evaluate.Evaluator.predictions_processor]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L211)
A core method of the `Evaluator` class, which processes the pipeline outputs for compatibility with the metric.
#### prepare_data[[evaluate.Evaluator.prepare_data]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L390)
Prepare data.
Example:
```py
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train")
>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")
```
**Parameters:**
data (`Dataset`) : Specifies the dataset we will run evaluation on.
input_column (`str`, defaults to `"text"`) : The name of the column containing the text feature in the dataset specified by `data`.
second_input_column(`str`, *optional*) : The name of the column containing the second text feature if there is one. Otherwise, set to `None`.
label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.
**Returns:**
``dict``
metric inputs.
`list`: pipeline inputs.
#### prepare_metric[[evaluate.Evaluator.prepare_metric]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L480)
Prepare metric.
Example:
```py
>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_metric("accuracy")
```
**Parameters:**
metric (`str` or [EvaluationModule](/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule), defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
**Returns:**
The loaded metric.
#### prepare_pipeline[[evaluate.Evaluator.prepare_pipeline]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L422)
Prepare pipeline.
Example:
```py
>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")
```
**Parameters:**
model_or_pipeline (`str` or [Pipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.Pipeline) or `Callable` or [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel) or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type `str` or is a model instance, we use it to initialize a new [Pipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.Pipeline) with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
preprocessor ([PreTrainedTokenizerBase](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase) or [FeatureExtractionMixin](https://huggingface.co/docs/transformers/main/en/main_classes/feature_extractor#transformers.FeatureExtractionMixin), *optional*, defaults to `None`) : Argument can be used to overwrite a default preprocessor if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
**Returns:**
The initialized pipeline.
## The task specific evaluators
### ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]
#### evaluate.ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L49)
Image classification evaluator.
This image classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`image-classification`.
Methods in this class assume a data format compatible with the `ImageClassificationPipeline`.
computeevaluate.ImageClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L68[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'image'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"image"`) --
The name of the column containing the images as PIL ImageFile in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
The name of the column containing the labels in the dataset specified by `data`.
- **label_mapping** (`Dict[str, Number]`, *optional*, defaults to `None`) --
We want to map class labels defined by the model in the pipeline to values consistent with those
defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"image"`) : The name of the column containing the images as PIL ImageFile in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.
label_mapping (`Dict[str, Number]`, *optional*, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]
#### evaluate.QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L75)
Question answering evaluator. This evaluator handles
[**extractive** question answering](https://huggingface.co/docs/transformers/task_summary#extractive-question-answering),
where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`question-answering`.
Methods in this class assume a data format compatible with the
`QuestionAnsweringPipeline`.
computeevaluate.QuestionAnsweringEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L144[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "question_column", "val": ": str = 'question'"}, {"name": "context_column", "val": ": str = 'context'"}, {"name": "id_column", "val": ": str = 'id'"}, {"name": "label_column", "val": ": str = 'answers'"}, {"name": "squad_v2_format", "val": ": typing.Optional[bool] = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **question_column** (`str`, defaults to `"question"`) --
The name of the column containing the question in the dataset specified by `data`.
- **context_column** (`str`, defaults to `"context"`) --
The name of the column containing the context in the dataset specified by `data`.
- **id_column** (`str`, defaults to `"id"`) --
The name of the column containing the identification field of the question and answer pair in the
dataset specified by `data`.
- **label_column** (`str`, defaults to `"answers"`) --
The name of the column containing the answers in the dataset specified by `data`.
- **squad_v2_format** (`bool`, *optional*, defaults to `None`) --
Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset
has questions where the answer is not in the context, more specifically when are answers as
`{"text": [], "answer_start": []}` in the answer column. If all questions have at least one answer, this parameter
should be set to `False`. If this parameter is not provided, the format will be automatically inferred.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
```
> [!TIP]
> Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
> the compute() call.
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
question_column (`str`, defaults to `"question"`) : The name of the column containing the question in the dataset specified by `data`.
context_column (`str`, defaults to `"context"`) : The name of the column containing the context in the dataset specified by `data`.
id_column (`str`, defaults to `"id"`) : The name of the column containing the identification field of the question and answer pair in the dataset specified by `data`.
label_column (`str`, defaults to `"answers"`) : The name of the column containing the answers in the dataset specified by `data`.
squad_v2_format (`bool`, *optional*, defaults to `None`) : Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as `{"text": [], "answer_start": []}` in the answer column. If all questions have at least one answer, this parameter should be set to `False`. If this parameter is not provided, the format will be automatically inferred.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]
#### evaluate.TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L51)
Text classification evaluator.
This text classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`text-classification` or with a `"sentiment-analysis"` alias.
Methods in this class assume a data format compatible with the [TextClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline) - a single textual
feature as input and a categorical label as output.
computeevaluate.TextClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L89[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "second_input_column", "val": ": typing.Optional[str] = None"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, *optional*, defaults to `"text"`) --
The name of the column containing the text feature in the dataset specified by `data`.
- **second_input_column** (`str`, *optional*, defaults to `None`) --
The name of the second column containing the text features. This may be useful for classification tasks
as MNLI, where two columns are used.
- **label_column** (`str`, defaults to `"label"`) --
The name of the column containing the labels in the dataset specified by `data`.
- **label_mapping** (`Dict[str, Number]`, *optional*, defaults to `None`) --
We want to map class labels defined by the model in the pipeline to values consistent with those
defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, *optional*, defaults to `"text"`) : The name of the column containing the text feature in the dataset specified by `data`.
second_input_column (`str`, *optional*, defaults to `None`) : The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.
label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.
label_mapping (`Dict[str, Number]`, *optional*, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]
#### evaluate.TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L84)
Token classification evaluator.
This token classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`token-classification`.
Methods in this class assume a data format compatible with the [TokenClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline).
computeevaluate.TokenClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L212[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": str = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'tokens'"}, {"name": "label_column", "val": ": str = 'ner_tags'"}, {"name": "join_by", "val": ": typing.Optional[str] = ' '"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"tokens"`) --
The name of the column containing the tokens feature in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
The name of the column containing the labels in the dataset specified by `data`.
- **join_by** (`str`, *optional*, defaults to `" "`) --
This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join
words to generate a string input. This is especially useful for languages that do not separate words by a space.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following [conll2003 dataset](https://huggingface.co/datasets/conll2003). Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
```
> [!TIP]
> For example, the following dataset format is accepted by the evaluator:
>
> ```python
> dataset = Dataset.from_dict(
> mapping={
> "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
> "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
> },
> features=Features({
> "tokens": Sequence(feature=Value(dtype="string")),
> "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
> }),
> )
> ```
> [!WARNING]
> For example, the following dataset format is **not** accepted by the evaluator:
>
> ```python
> dataset = Dataset.from_dict(
> mapping={
> "tokens": [["New York is a city and Felix a person."]],
> "starts": [[0, 23]],
> "ends": [[7, 27]],
> "ner_tags": [["LOC", "PER"]],
> },
> features=Features({
> "tokens": Value(dtype="string"),
> "starts": Sequence(feature=Value(dtype="int32")),
> "ends": Sequence(feature=Value(dtype="int32")),
> "ner_tags": Sequence(feature=Value(dtype="string")),
> }),
> )
> ```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"tokens"`) : The name of the column containing the tokens feature in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.
join_by (`str`, *optional*, defaults to `" "`) : This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]
#### evaluate.TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_generation.py#L31)
Text generation evaluator.
This Text generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`text-generation`.
Methods in this class assume a data format compatible with the [TextGenerationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextGenerationPipeline).
computeevaluate.TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L218[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]
### Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]
#### evaluate.Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L88)
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`text2text-generation`.
Methods in this class assume a data format compatible with the `Text2TextGenerationPipeline`.
computeevaluate.Text2TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L105[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"text"`) --
the name of the column containing the input text in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
the name of the column containing the labels in the dataset specified by `data`.
- **generation_kwargs** (`Dict`, *optional*, defaults to `None`) --
The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text2text-generation")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> metric="rouge",
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.
generation_kwargs (`Dict`, *optional*, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### SummarizationEvaluator[[evaluate.SummarizationEvaluator]]
#### evaluate.SummarizationEvaluator[[evaluate.SummarizationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L152)
Text summarization evaluator.
This text summarization evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`summarization`.
Methods in this class assume a data format compatible with the [SummarizationEvaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.SummarizationEvaluator).
computeevaluate.SummarizationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L166[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"text"`) --
the name of the column containing the input text in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
the name of the column containing the labels in the dataset specified by `data`.
- **generation_kwargs** (`Dict`, *optional*, defaults to `None`) --
The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("summarization")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.
generation_kwargs (`Dict`, *optional*, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### TranslationEvaluator[[evaluate.TranslationEvaluator]]
#### evaluate.TranslationEvaluator[[evaluate.TranslationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L211)
Translation evaluator.
This translation generation evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`translation`.
Methods in this class assume a data format compatible with the `TranslationPipeline`.
computeevaluate.TranslationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L225[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"text"`) --
the name of the column containing the input text in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
the name of the column containing the labels in the dataset specified by `data`.
- **generation_kwargs** (`Dict`, *optional*, defaults to `None`) --
The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("translation")
>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
>>> data=data,
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"text"`) : the name of the column containing the input text in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : the name of the column containing the labels in the dataset specified by `data`.
generation_kwargs (`Dict`, *optional*, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]
#### evaluate.AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L47)
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`automatic-speech-recognition`.
Methods in this class assume a data format compatible with the `AutomaticSpeechRecognitionPipeline`.
computeevaluate.AutomaticSpeechRecognitionEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L63[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'path'"}, {"name": "label_column", "val": ": str = 'sentence'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"path"`) --
the name of the column containing the input audio path in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"sentence"`) --
the name of the column containing the labels in the dataset specified by `data`.
- **generation_kwargs** (`Dict`, *optional*, defaults to `None`) --
The generation kwargs are passed to the pipeline and set the text generation strategy.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>> data=data,
>>> input_column="path",
>>> label_column="sentence",
>>> metric="wer",
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"path"`) : the name of the column containing the input audio path in the dataset specified by `data`.
label_column (`str`, defaults to `"sentence"`) : the name of the column containing the labels in the dataset specified by `data`.
generation_kwargs (`Dict`, *optional*, defaults to `None`) : The generation kwargs are passed to the pipeline and set the text generation strategy.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
### AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]
#### evaluate.AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]
[Source](https://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L75)
Audio classification evaluator.
This audio classification evaluator can currently be loaded from [evaluator()](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.evaluator) using the default task name
`audio-classification`.
Methods in this class assume a data format compatible with the [transformers.AudioClassificationPipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.AudioClassificationPipeline).
computeevaluate.AudioClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L94[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'file'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- **model_or_pipeline** (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
`text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or
is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
- **data** (`str` or `Dataset`, defaults to `None`) --
Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset
name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- **subset** (`str`, defaults to `None`) --
Defines which dataset subset to load. If `None` is passed the default subset is loaded.
- **split** (`str`, defaults to `None`) --
Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
- **metric** (`str` or `EvaluationModule`, defaults to `None`) --
Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and
load it. Otherwise we assume it represents a pre-loaded metric.
- **tokenizer** (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) --
Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for
which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore
this argument.
- **strategy** (`Literal["simple", "bootstrap"]`, defaults to "simple") --
specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- **confidence_level** (`float`, defaults to `0.95`) --
The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **n_resamples** (`int`, defaults to `9999`) --
The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
- **device** (`int`, defaults to `None`) --
Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive
integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and
CUDA:0 used if available, CPU otherwise.
- **random_state** (`int`, *optional*, defaults to `None`) --
The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for
debugging.
- **input_column** (`str`, defaults to `"file"`) --
The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by `data`.
- **label_column** (`str`, defaults to `"label"`) --
The name of the column containing the labels in the dataset specified by `data`.
- **label_mapping** (`Dict[str, Number]`, *optional*, defaults to `None`) --
We want to map class labels defined by the model in the pipeline to values consistent with those
defined in the `label_column` of the `data` dataset.0A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
> [!TIP]
> Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="file",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )
```
> [!TIP]
> The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
> the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.
```python
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="audio",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )
```
**Parameters:**
model_or_pipeline (`str` or `Pipeline` or `Callable` or `PreTrainedModel` or `TFPreTrainedModel`, defaults to `None`) : If the argument in not specified, we initialize the default pipeline for the task (in this case `text-classification` or its alias - `sentiment-analysis`). If the argument is of the type `str` or is a model instance, we use it to initialize a new `Pipeline` with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (`str` or `Dataset`, defaults to `None`) : Specifies the dataset we will run evaluation on. If it is of type `str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (`str`, defaults to `None`) : Defines which dataset subset to load. If `None` is passed the default subset is loaded.
split (`str`, defaults to `None`) : Defines which dataset split to load. If `None` is passed, infers based on the `choose_split` function.
metric (`str` or `EvaluationModule`, defaults to `None`) : Specifies the metric we use in evaluator. If it is of type `str`, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (`str` or `PreTrainedTokenizer`, *optional*, defaults to `None`) : Argument can be used to overwrite a default tokenizer if `model_or_pipeline` represents a model for which we build a pipeline. If `model_or_pipeline` is `None` or a pre-initialized pipeline, we ignore this argument.
strategy (`Literal["simple", "bootstrap"]`, defaults to "simple") : specifies the evaluation strategy. Possible values are: - `"simple"` - we evaluate the metric and return the scores. - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using `scipy`'s `bootstrap` method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (`float`, defaults to `0.95`) : The `confidence_level` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
n_resamples (`int`, defaults to `9999`) : The `n_resamples` value passed to `bootstrap` if `"bootstrap"` strategy is chosen.
device (`int`, defaults to `None`) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If `None` is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (`int`, *optional*, defaults to `None`) : The `random_state` value passed to `bootstrap` if `"bootstrap"` strategy is chosen. Useful for debugging.
input_column (`str`, defaults to `"file"`) : The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by `data`.
label_column (`str`, defaults to `"label"`) : The name of the column containing the labels in the dataset specified by `data`.
label_mapping (`Dict[str, Number]`, *optional*, defaults to `None`) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column` of the `data` dataset.
**Returns:**
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.

Xet Storage Details

Size:
101 kB
·
Xet hash:
76399bcd007d5fe0ed3c88cec6516214bcd139fc124eaebfdbcdf87aeadc6000

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.