Buckets:
Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes[[evaluate.evaluator]]
The main entry point for using the evaluator:
evaluate.evaluator[[evaluate.evaluator]]
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline functionality from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
Examples:
>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")
Parameters:
task (str) : The task defining which evaluator will be returned. Currently accepted tasks are: - "image-classification": will return a ImageClassificationEvaluator. - "question-answering": will return a QuestionAnsweringEvaluator. - "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator. - "token-classification": will return a TokenClassificationEvaluator.
Returns:
[Evaluator](/docs/evaluate/main/en/package_reference/evaluator_classes#evaluate.Evaluator)
An evaluator suitable for the task.
The base class for all evaluator classes:
evaluate.Evaluator[[evaluate.Evaluator]]
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
check_required_columnsevaluate.Evaluator.check_required_columnshttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L295[{"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset]"}, {"name": "columns_names", "val": ": typing.Dict[str, str]"}]- data (str or Dataset) --
Specifies the dataset we will run evaluation on.
- columns_names (
List[str]) -- List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.0
Ensure the columns required for the evaluation are present in the dataset.
Example:
>>> from datasets import load_dataset
>>> from evaluate import evaluator
>>> data = load_dataset("rotten_tomatoes', split="train")
>>> evaluator.check_required_columns(data, {"input_column": "text", "label_column": "label"})
Parameters:
data (str or Dataset) : Specifies the dataset we will run evaluation on.
columns_names (List[str]) : List of column names to check in the dataset. The keys are the arguments to the evaluate.EvaluationModule.compute() method, while the values are the column names to check.
compute_metric[[evaluate.Evaluator.compute_metric]]
Compute and return metrics.
get_dataset_split[[evaluate.Evaluator.get_dataset_split]]
Infers which split to use if None is given.
Example:
>>> from evaluate import evaluator
>>> evaluator("text-classification").get_dataset_split(data="rotten_tomatoes")
WARNING:evaluate.evaluator.base:Dataset split not defined! Automatically evaluating with split: TEST
'test'
Parameters:
data (str) : Name of dataset.
subset (str) : Name of config for datasets with multiple configurations (e.g. 'glue/cola').
split (str, defaults to None) : Split to use.
Returns:
split
str containing which split to use
load_data[[evaluate.Evaluator.load_data]]
Load dataset with given subset and split.
Example:
>>> from evaluate import evaluator
>>> evaluator("text-classification").load_data(data="rotten_tomatoes", split="train")
Dataset({
features: ['text', 'label'],
num_rows: 8530
})
Parameters:
data (Dataset or str, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Specifies dataset subset to be passed to name in load_dataset. To be used with datasets with several configurations (e.g. glue/sst2).
split (str, defaults to None) : User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is a str type, will automatically select the best one via choose_split().
Returns:
data (Dataset)
Loaded dataset which will be used for evaluation.
predictions_processor[[evaluate.Evaluator.predictions_processor]]
A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.
prepare_data[[evaluate.Evaluator.prepare_data]]
Prepare data.
Example:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train")
>>> evaluator("text-classification").prepare_data(ds, input_column="text", second_input_column=None, label_column="label")
Parameters:
data (Dataset) : Specifies the dataset we will run evaluation on.
input_column (str, defaults to "text") : The name of the column containing the text feature in the dataset specified by data.
second_input_column(str, optional) : The name of the column containing the second text feature if there is one. Otherwise, set to None.
label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.
Returns:
dict
metric inputs.
list: pipeline inputs.
prepare_metric[[evaluate.Evaluator.prepare_metric]]
Prepare metric.
Example:
>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_metric("accuracy")
Parameters:
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
Returns:
The loaded metric.
prepare_pipeline[[evaluate.Evaluator.prepare_pipeline]]
Prepare pipeline.
Example:
>>> from evaluate import evaluator
>>> evaluator("text-classification").prepare_pipeline(model_or_pipeline="distilbert-base-uncased")
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) : Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
Returns:
The initialized pipeline.
The task specific evaluators
ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]
evaluate.ImageClassificationEvaluator[[evaluate.ImageClassificationEvaluator]]
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification.
Methods in this class assume a data format compatible with the ImageClassificationPipeline.
computeevaluate.ImageClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/image_classification.py#L68[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'image'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"image") -- The name of the column containing the images as PIL ImageFile in the dataset specified bydata.label_column (
str, defaults to"label") -- The name of the column containing the labels in the dataset specified bydata.label_mapping (
Dict[str, Number], optional, defaults toNone) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_columnof thedatadataset.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "image") : The name of the column containing the images as PIL ImageFile in the dataset specified by data.
label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]
evaluate.QuestionAnsweringEvaluator[[evaluate.QuestionAnsweringEvaluator]]
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering.
Methods in this class assume a data format compatible with the
QuestionAnsweringPipeline.
computeevaluate.QuestionAnsweringEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/question_answering.py#L144[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "question_column", "val": ": str = 'question'"}, {"name": "context_column", "val": ": str = 'context'"}, {"name": "id_column", "val": ": str = 'id'"}, {"name": "label_column", "val": ": str = 'answers'"}, {"name": "squad_v2_format", "val": ": typing.Optional[bool] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.question_column (
str, defaults to"question") -- The name of the column containing the question in the dataset specified bydata.context_column (
str, defaults to"context") -- The name of the column containing the context in the dataset specified bydata.id_column (
str, defaults to"id") -- The name of the column containing the identification field of the question and answer pair in the dataset specified bydata.label_column (
str, defaults to"answers") -- The name of the column containing the answers in the dataset specified bydata.squad_v2_format (
bool, optional, defaults toNone) -- Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as{"text": [], "answer_start": []}in the answer column. If all questions have at least one answer, this parameter should be set toFalse. If this parameter is not provided, the format will be automatically inferred.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass
squad_v2_format=Trueto the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
question_column (str, defaults to "question") : The name of the column containing the question in the dataset specified by data.
context_column (str, defaults to "context") : The name of the column containing the context in the dataset specified by data.
id_column (str, defaults to "id") : The name of the column containing the identification field of the question and answer pair in the dataset specified by data.
label_column (str, defaults to "answers") : The name of the column containing the answers in the dataset specified by data.
squad_v2_format (bool, optional, defaults to None) : Whether the dataset follows the format of squad_v2 dataset. This is the case when the provided dataset has questions where the answer is not in the context, more specifically when are answers as {"text": [], "answer_start": []} in the answer column. If all questions have at least one answer, this parameter should be set to False. If this parameter is not provided, the format will be automatically inferred.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]
evaluate.TextClassificationEvaluator[[evaluate.TextClassificationEvaluator]]
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification or with a "sentiment-analysis" alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual
feature as input and a categorical label as output.
computeevaluate.TextClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text_classification.py#L89[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "second_input_column", "val": ": typing.Optional[str] = None"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, optional, defaults to"text") -- The name of the column containing the text feature in the dataset specified bydata.second_input_column (
str, optional, defaults toNone) -- The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.label_column (
str, defaults to"label") -- The name of the column containing the labels in the dataset specified bydata.label_mapping (
Dict[str, Number], optional, defaults toNone) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_columnof thedatadataset.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, optional, defaults to "text") : The name of the column containing the text feature in the dataset specified by data.
second_input_column (str, optional, defaults to None) : The name of the second column containing the text features. This may be useful for classification tasks as MNLI, where two columns are used.
label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]
evaluate.TokenClassificationEvaluator[[evaluate.TokenClassificationEvaluator]]
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification.
Methods in this class assume a data format compatible with the TokenClassificationPipeline.
computeevaluate.TokenClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/token_classification.py#L212[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": str = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": typing.Optional[int] = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'tokens'"}, {"name": "label_column", "val": ": str = 'ner_tags'"}, {"name": "join_by", "val": ": typing.Optional[str] = ' '"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"tokens") -- The name of the column containing the tokens feature in the dataset specified bydata.label_column (
str, defaults to"label") -- The name of the column containing the labels in the dataset specified bydata.join_by (
str, optional, defaults to" ") -- This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict( mapping={ "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]], "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]], }, features=Features({ "tokens": Sequence(feature=Value(dtype="string")), "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])), }), )
For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict( mapping={ "tokens": [["New York is a city and Felix a person."]], "starts": [[0, 23]], "ends": [[7, 27]], "ner_tags": [["LOC", "PER"]], }, features=Features({ "tokens": Value(dtype="string"), "starts": Sequence(feature=Value(dtype="int32")), "ends": Sequence(feature=Value(dtype="int32")), "ner_tags": Sequence(feature=Value(dtype="string")), }), )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "tokens") : The name of the column containing the tokens feature in the dataset specified by data.
label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.
join_by (str, optional, defaults to " ") : This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]
evaluate.TextGenerationEvaluator[[evaluate.TextGenerationEvaluator]]
Text generation evaluator.
This Text generation evaluator can currently be loaded from evaluator() using the default task name
text-generation.
Methods in this class assume a data format compatible with the TextGenerationPipeline.
computeevaluate.TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/base.py#L218[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]
Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]
evaluate.Text2TextGenerationEvaluator[[evaluate.Text2TextGenerationEvaluator]]
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name
text2text-generation.
Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.
computeevaluate.Text2TextGenerationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L105[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"text") -- the name of the column containing the input text in the dataset specified bydata.label_column (
str, defaults to"label") -- the name of the column containing the labels in the dataset specified bydata.generation_kwargs (
Dict, optional, defaults toNone) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text2text-generation")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> metric="rouge",
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
SummarizationEvaluator[[evaluate.SummarizationEvaluator]]
evaluate.SummarizationEvaluator[[evaluate.SummarizationEvaluator]]
Text summarization evaluator.
This text summarization evaluator can currently be loaded from evaluator() using the default task name
summarization.
Methods in this class assume a data format compatible with the SummarizationEvaluator.
computeevaluate.SummarizationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L166[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"text") -- the name of the column containing the input text in the dataset specified bydata.label_column (
str, defaults to"label") -- the name of the column containing the labels in the dataset specified bydata.generation_kwargs (
Dict, optional, defaults toNone) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("summarization")
>>> data = load_dataset("cnn_dailymail", "3.0.0", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="facebook/bart-large-cnn",
>>> data=data,
>>> input_column="article",
>>> label_column="highlights",
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
TranslationEvaluator[[evaluate.TranslationEvaluator]]
evaluate.TranslationEvaluator[[evaluate.TranslationEvaluator]]
Translation evaluator.
This translation generation evaluator can currently be loaded from evaluator() using the default task name
translation.
Methods in this class assume a data format compatible with the TranslationPipeline.
computeevaluate.TranslationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/text2text_generation.py#L225[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'text'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"text") -- the name of the column containing the input text in the dataset specified bydata.label_column (
str, defaults to"label") -- the name of the column containing the labels in the dataset specified bydata.generation_kwargs (
Dict, optional, defaults toNone) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("translation")
>>> data = load_dataset("wmt19", "fr-de", split="validation[:40]")
>>> data = data.map(lambda x: {"text": x["translation"]["de"], "label": x["translation"]["fr"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline="Helsinki-NLP/opus-mt-de-fr",
>>> data=data,
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") : the name of the column containing the input text in the dataset specified by data.
label_column (str, defaults to "label") : the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]
evaluate.AutomaticSpeechRecognitionEvaluator[[evaluate.AutomaticSpeechRecognitionEvaluator]]
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name
automatic-speech-recognition.
Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline.
computeevaluate.AutomaticSpeechRecognitionEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/automatic_speech_recognition.py#L63[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'path'"}, {"name": "label_column", "val": ": str = 'sentence'"}, {"name": "generation_kwargs", "val": ": dict = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"path") -- the name of the column containing the input audio path in the dataset specified bydata.label_column (
str, defaults to"sentence") -- the name of the column containing the labels in the dataset specified bydata.generation_kwargs (
Dict, optional, defaults toNone) -- The generation kwargs are passed to the pipeline and set the text generation strategy.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>> data=data,
>>> input_column="path",
>>> label_column="sentence",
>>> metric="wer",
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "path") : the name of the column containing the input audio path in the dataset specified by data.
label_column (str, defaults to "sentence") : the name of the column containing the labels in the dataset specified by data.
generation_kwargs (Dict, optional, defaults to None) : The generation kwargs are passed to the pipeline and set the text generation strategy.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]
evaluate.AudioClassificationEvaluator[[evaluate.AudioClassificationEvaluator]]
Audio classification evaluator.
This audio classification evaluator can currently be loaded from evaluator() using the default task name
audio-classification.
Methods in this class assume a data format compatible with the transformers.AudioClassificationPipeline.
computeevaluate.AudioClassificationEvaluator.computehttps://github.com/huggingface/evaluate/blob/main/src/evaluate/evaluator/audio_classification.py#L94[{"name": "model_or_pipeline", "val": ": typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None"}, {"name": "data", "val": ": typing.Union[str, datasets.arrow_dataset.Dataset] = None"}, {"name": "subset", "val": ": typing.Optional[str] = None"}, {"name": "split", "val": ": typing.Optional[str] = None"}, {"name": "metric", "val": ": typing.Union[str, evaluate.module.EvaluationModule] = None"}, {"name": "tokenizer", "val": ": typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None"}, {"name": "feature_extractor", "val": ": typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None"}, {"name": "strategy", "val": ": typing.Literal['simple', 'bootstrap'] = 'simple'"}, {"name": "confidence_level", "val": ": float = 0.95"}, {"name": "n_resamples", "val": ": int = 9999"}, {"name": "device", "val": ": int = None"}, {"name": "random_state", "val": ": typing.Optional[int] = None"}, {"name": "input_column", "val": ": str = 'file'"}, {"name": "label_column", "val": ": str = 'label'"}, {"name": "label_mapping", "val": ": typing.Optional[typing.Dict[str, numbers.Number]] = None"}]- model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) --
If the argument in not specified, we initialize the default pipeline for the task (in this case
text-classification or its alias - sentiment-analysis). If the argument is of the type str or
is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the
argument specifies a pre-initialized pipeline.
data (
strorDataset, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.subset (
str, defaults toNone) -- Defines which dataset subset to load. IfNoneis passed the default subset is loaded.split (
str, defaults toNone) -- Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.metric (
strorEvaluationModule, defaults toNone) -- Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.tokenizer (
strorPreTrainedTokenizer, optional, defaults toNone) -- Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.strategy (
Literal["simple", "bootstrap"], defaults to "simple") -- specifies the evaluation strategy. Possible values are:"simple"- we evaluate the metric and return the scores."bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy'sbootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (
float, defaults to0.95) -- Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.n_resamples (
int, defaults to9999) -- Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.device (
int, defaults toNone) -- Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.random_state (
int, optional, defaults toNone) -- Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.input_column (
str, defaults to"file") -- The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified bydata.label_column (
str, defaults to"label") -- The name of the column containing the labels in the dataset specified bydata.label_mapping (
Dict[str, Number], optional, defaults toNone) -- We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_columnof thedatadataset.0ADict. The keys represent metric keys calculated for themetricspefied in function arguments. For the"simple"strategy, the value is the metric score. For the"bootstrap"strategy, the value is aDictcontaining the score, the confidence interval and the standard error calculated for each metric key.
Compute the metric for a given pipeline and dataset combination.
Examples:
Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="file",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )
The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("audio-classification")
>>> data = load_dataset("superb", 'ks', split="test[:40]")
>>> data = data.map(lambda example: {"audio": example["audio"]["array"]})
>>> results = task_evaluator.compute(
>>> model_or_pipeline=""superb/wav2vec2-base-superb-ks"",
>>> data=data,
>>> label_column="label",
>>> input_column="audio",
>>> metric="accuracy",
>>> label_mapping={0: "yes", 1: "no", 2: "up", 3: "down"}
>>> )
Parameters:
model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) : If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) : Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
subset (str, defaults to None) : Defines which dataset subset to load. If None is passed the default subset is loaded.
split (str, defaults to None) : Defines which dataset split to load. If None is passed, infers based on the choose_split function.
metric (str or EvaluationModule, defaults to None) : Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) : Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to "simple") : specifies the evaluation strategy. Possible values are: - "simple" - we evaluate the metric and return the scores. - "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy's bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) : The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) : The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) : Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) : The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "file") : The name of the column containing either the audio files or a raw waveform, represented as a numpy array, in the dataset specified by data.
label_column (str, defaults to "label") : The name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) : We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.
Returns:
A Dict. The keys represent metric keys calculated for the metric spefied in function arguments. For the
"simple" strategy, the value is the metric score. For the "bootstrap" strategy, the value is a Dict
containing the score, the confidence interval and the standard error calculated for each metric key.
Xet Storage Details
- Size:
- 101 kB
- Xet hash:
- 76399bcd007d5fe0ed3c88cec6516214bcd139fc124eaebfdbcdf87aeadc6000
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.