Buckets:
| <meta charset="utf-8" /><meta http-equiv="content-security-policy" content=""><meta name="hf:doc:metadata" content="{"local":"using-the-evaluator","sections":[{"local":"text-classification","sections":[{"local":"evaluate-models-on-the-hub","title":"Evaluate models on the Hub"},{"local":"evaluate-multiple-metrics","title":"Evaluate multiple metrics"}],"title":"Text classification"},{"local":"token-classification","sections":[{"local":"benchmarking-several-models","title":"Benchmarking several models"}],"title":"Token Classification"},{"local":"question-answering","sections":[{"local":"confidence-intervals","title":"Confidence intervals"}],"title":"Question Answering"},{"local":"image-classification","sections":[{"local":"handling-large-datasets","title":"Handling large datasets"}],"title":"Image classification"}],"title":"Using the `evaluator`"}" data-svelte="svelte-1phssyn"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/assets/pages/__layout.svelte-hf-doc-builder.css"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/start-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/chunks/vendor-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/chunks/paths-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/pages/__layout.svelte-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/pages/base_evaluator.mdx-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/chunks/Tip-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/chunks/IconCopyLink-hf-doc-builder.js"> | |
| <link rel="modulepreload" href="/docs/evaluate/v0.2.2/en/_app/chunks/CodeBlock-hf-doc-builder.js"> | |
| <h1 class="relative group"><a id="using-the-evaluator" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#using-the-evaluator"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Using the <code>evaluator</code></span></h1> | |
| <p>The <code>Evaluator</code> classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, <code>Evaluator</code>s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section <a href="custom_evaluator">Using the <code>evaluator</code> with custom pipelines</a>.</p> | |
| <p>Currently supported tasks are:</p> | |
| <ul><li><code>"text-classification"</code>: will use the <a href="/docs/evaluate/v0.2.2/en/package_reference/evaluator_classes#evaluate.TextClassificationEvaluator">TextClassificationEvaluator</a>.</li> | |
| <li><code>"token-classification"</code>: will use the <a href="/docs/evaluate/v0.2.2/en/package_reference/evaluator_classes#evaluate.TokenClassificationEvaluator">TokenClassificationEvaluator</a>.</li> | |
| <li><code>"question-answering"</code>: will use the <a href="/docs/evaluate/v0.2.2/en/package_reference/evaluator_classes#evaluate.QuestionAnsweringEvaluator">QuestionAnsweringEvaluator</a>.</li> | |
| <li><code>"image-classification"</code>: will use the <a href="/docs/evaluate/v0.2.2/en/package_reference/evaluator_classes#evaluate.ImageClassificationEvaluator">ImageClassificationEvaluator</a>.</li></ul> | |
| <p>Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let’s have a look at each one of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time.</p> | |
| <h2 class="relative group"><a id="text-classification" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#text-classification"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Text classification | |
| </span></h2> | |
| <p>The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs:</p> | |
| <ul><li><code>input_column="text"</code>: with this argument the column with the data for the pipeline can be specified.</li> | |
| <li><code>label_column="label"</code>: with this argument the column with the labels for the evaluation can be specified.</li> | |
| <li><code>label_mapping=None</code>: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in <code>label_column</code> can be integers (<code>0</code>/<code>1</code>) whereas the pipeline can produce label names such as <code>"positive"</code>/<code>"negative"</code>. With that dictionary the pipeline outputs are mapped to the labels.</li></ul> | |
| <p>By default the <code>"accuracy"</code> metric is computed.</p> | |
| <h3 class="relative group"><a id="evaluate-models-on-the-hub" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#evaluate-models-on-the-hub"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Evaluate models on the Hub | |
| </span></h3> | |
| <p>There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a <code>transformers</code> model and pass it to the evaluator or you can pass an initialized <code>transformers.Pipeline</code>. Alternatively you can pass any callable function that behaves like a <code>pipeline</code> call for the task in any framework.</p> | |
| <p>So any of the following works:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-keyword">from</span> evaluate <span class="hljs-keyword">import</span> evaluator | |
| <span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForSequenceClassification, pipeline | |
| data = load_dataset(<span class="hljs-string">"imdb"</span>, split=<span class="hljs-string">"test"</span>).shuffle(seed=<span class="hljs-number">42</span>).select(<span class="hljs-built_in">range</span>(<span class="hljs-number">1000</span>)) | |
| task_evaluator = evaluator(<span class="hljs-string">"text-classification"</span>) | |
| <span class="hljs-comment"># 1. Pass a model name or path</span> | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=<span class="hljs-string">"lvwerra/distilbert-imdb"</span>, | |
| data=data, | |
| label_mapping={<span class="hljs-string">"NEGATIVE"</span>: <span class="hljs-number">0</span>, <span class="hljs-string">"POSITIVE"</span>: <span class="hljs-number">1</span>} | |
| ) | |
| <span class="hljs-comment"># 2. Pass an instantiated model</span> | |
| model = AutoModelForSequenceClassification.from_pretrained(<span class="hljs-string">"lvwerra/distilbert-imdb"</span>) | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=model, | |
| data=data, | |
| label_mapping={<span class="hljs-string">"NEGATIVE"</span>: <span class="hljs-number">0</span>, <span class="hljs-string">"POSITIVE"</span>: <span class="hljs-number">1</span>} | |
| ) | |
| <span class="hljs-comment"># 3. Pass an instantiated pipeline </span> | |
| pipe = pipeline(<span class="hljs-string">"text-classification"</span>, model=<span class="hljs-string">"lvwerra/distilbert-imdb"</span>) | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=pipe, | |
| data=data, | |
| label_mapping={<span class="hljs-string">"NEGATIVE"</span>: <span class="hljs-number">0</span>, <span class="hljs-string">"POSITIVE"</span>: <span class="hljs-number">1</span>} | |
| ) | |
| <span class="hljs-built_in">print</span>(eval_results)<!-- HTML_TAG_END --></pre></div> | |
| <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p>Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass <code>device</code> to <code>compute</code> where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.</p></div> | |
| <p>The results will look as follows:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START -->{ | |
| <span class="hljs-string">'accuracy'</span>: <span class="hljs-number">0.918</span>, | |
| <span class="hljs-string">'latency_in_seconds'</span>: <span class="hljs-number">0.013</span>, | |
| <span class="hljs-string">'samples_per_second'</span>: <span class="hljs-number">78.887</span>, | |
| <span class="hljs-string">'total_time_in_seconds'</span>: <span class="hljs-number">12.676</span> | |
| }<!-- HTML_TAG_END --></pre></div> | |
| <p>Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline. </p> | |
| <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p>The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.</p></div> | |
| <h3 class="relative group"><a id="evaluate-multiple-metrics" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#evaluate-multiple-metrics"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Evaluate multiple metrics | |
| </span></h3> | |
| <p>With the <a href="/docs/evaluate/v0.2.2/en/package_reference/main_classes#evaluate.combine">combine()</a> function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> evaluate | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=<span class="hljs-string">"lvwerra/distilbert-imdb"</span>, | |
| data=data, | |
| metric=evaluate.combine([<span class="hljs-string">"accuracy"</span>, <span class="hljs-string">"recall"</span>, <span class="hljs-string">"precision"</span>, <span class="hljs-string">"f1"</span>]), | |
| label_mapping={<span class="hljs-string">"NEGATIVE"</span>: <span class="hljs-number">0</span>, <span class="hljs-string">"POSITIVE"</span>: <span class="hljs-number">1</span>} | |
| ) | |
| <span class="hljs-built_in">print</span>(eval_results) | |
| <!-- HTML_TAG_END --></pre></div> | |
| <p>The results will look as follows:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START -->{ | |
| <span class="hljs-string">'accuracy'</span>: <span class="hljs-number">0.918</span>, | |
| <span class="hljs-string">'f1'</span>: <span class="hljs-number">0.916</span>, | |
| <span class="hljs-string">'precision'</span>: <span class="hljs-number">0.9147</span>, | |
| <span class="hljs-string">'recall'</span>: <span class="hljs-number">0.9187</span>, | |
| <span class="hljs-string">'latency_in_seconds'</span>: <span class="hljs-number">0.013</span>, | |
| <span class="hljs-string">'samples_per_second'</span>: <span class="hljs-number">78.887</span>, | |
| <span class="hljs-string">'total_time_in_seconds'</span>: <span class="hljs-number">12.676</span> | |
| }<!-- HTML_TAG_END --></pre></div> | |
| <p>Next let’s have a look at token classification.</p> | |
| <h2 class="relative group"><a id="token-classification" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#token-classification"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Token Classification | |
| </span></h2> | |
| <p>With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments:</p> | |
| <ul><li><code>input_column="text"</code>: with this argument the column with the data for the pipeline can be specified.</li> | |
| <li><code>label_column="label"</code>: with this argument the column with the labels for the evaluation can be specified.</li> | |
| <li><code>label_mapping=None</code>: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in <code>label_column</code> can be integers (<code>0</code>/<code>1</code>) whereas the pipeline can produce label names such as <code>"positive"</code>/<code>"negative"</code>. With that dictionary the pipeline outputs are mapped to the labels.</li> | |
| <li><code>join_by=" "</code>: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace.</li></ul> | |
| <p>Let’s have a look how we can use the evaluator to benchmark several models.</p> | |
| <h3 class="relative group"><a id="benchmarking-several-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#benchmarking-several-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Benchmarking several models | |
| </span></h3> | |
| <p>Here is an example where several models can be compared thanks to the <code>evaluator</code> in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd | |
| <span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-keyword">from</span> evaluate <span class="hljs-keyword">import</span> evaluator | |
| <span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline | |
| models = [ | |
| <span class="hljs-string">"xlm-roberta-large-finetuned-conll03-english"</span>, | |
| <span class="hljs-string">"dbmdz/bert-large-cased-finetuned-conll03-english"</span>, | |
| <span class="hljs-string">"elastic/distilbert-base-uncased-finetuned-conll03-english"</span>, | |
| <span class="hljs-string">"dbmdz/electra-large-discriminator-finetuned-conll03-english"</span>, | |
| <span class="hljs-string">"gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner"</span>, | |
| <span class="hljs-string">"philschmid/distilroberta-base-ner-conll2003"</span>, | |
| <span class="hljs-string">"Jorgeutd/albert-base-v2-finetuned-ner"</span>, | |
| ] | |
| data = load_dataset(<span class="hljs-string">"conll2003"</span>, split=<span class="hljs-string">"validation"</span>).shuffle().select(<span class="hljs-number">1000</span>) | |
| task_evaluator = evaluator(<span class="hljs-string">"token-classification"</span>) | |
| results = [] | |
| <span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> models: | |
| results.append( | |
| task_evaluator.compute( | |
| model_or_pipeline=model, data=data, metric=<span class="hljs-string">"seqeval"</span> | |
| ) | |
| ) | |
| df = pd.DataFrame(results, index=models) | |
| df[[<span class="hljs-string">"overall_f1"</span>, <span class="hljs-string">"overall_accuracy"</span>, <span class="hljs-string">"total_time_in_seconds"</span>, <span class="hljs-string">"samples_per_second"</span>, <span class="hljs-string">"latency_in_seconds"</span>]]<!-- HTML_TAG_END --></pre></div> | |
| <p>The result is a table that looks like this:</p> | |
| <table><thead><tr><th align="left">model</th> | |
| <th align="right">overall_f1</th> | |
| <th align="right">overall_accuracy</th> | |
| <th align="right">total_time_in_seconds</th> | |
| <th align="right">samples_per_second</th> | |
| <th align="right">latency_in_seconds</th></tr></thead> | |
| <tbody><tr><td align="left">Jorgeutd/albert-base-v2-finetuned-ner</td> | |
| <td align="right">0.941</td> | |
| <td align="right">0.989</td> | |
| <td align="right">4.515</td> | |
| <td align="right">221.468</td> | |
| <td align="right">0.005</td></tr> | |
| <tr><td align="left">dbmdz/bert-large-cased-finetuned-conll03-english</td> | |
| <td align="right">0.962</td> | |
| <td align="right">0.881</td> | |
| <td align="right">11.648</td> | |
| <td align="right">85.850</td> | |
| <td align="right">0.012</td></tr> | |
| <tr><td align="left">dbmdz/electra-large-discriminator-finetuned-conll03-english</td> | |
| <td align="right">0.965</td> | |
| <td align="right">0.881</td> | |
| <td align="right">11.456</td> | |
| <td align="right">87.292</td> | |
| <td align="right">0.011</td></tr> | |
| <tr><td align="left">elastic/distilbert-base-uncased-finetuned-conll03-english</td> | |
| <td align="right">0.940</td> | |
| <td align="right">0.989</td> | |
| <td align="right">2.318</td> | |
| <td align="right">431.378</td> | |
| <td align="right">0.002</td></tr> | |
| <tr><td align="left">gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner</td> | |
| <td align="right">0.947</td> | |
| <td align="right">0.991</td> | |
| <td align="right">2.376</td> | |
| <td align="right">420.873</td> | |
| <td align="right">0.002</td></tr> | |
| <tr><td align="left">philschmid/distilroberta-base-ner-conll2003</td> | |
| <td align="right">0.961</td> | |
| <td align="right">0.994</td> | |
| <td align="right">2.436</td> | |
| <td align="right">410.579</td> | |
| <td align="right">0.002</td></tr> | |
| <tr><td align="left">xlm-roberta-large-finetuned-conll03-english</td> | |
| <td align="right">0.969</td> | |
| <td align="right">0.882</td> | |
| <td align="right">11.996</td> | |
| <td align="right">83.359</td> | |
| <td align="right">0.012</td></tr></tbody></table> | |
| <h2 class="relative group"><a id="question-answering" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#question-answering"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Question Answering | |
| </span></h2> | |
| <p>With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that’s required for these models. It has the following specific arguments:</p> | |
| <ul><li><code>question_column="question"</code>: the name of the column containing the question in the dataset </li> | |
| <li><code>context_column="context"</code>: the name of the column containing the context</li> | |
| <li><code>id_column="id"</code>: the name of the column cointaing the identification field of the question and answer pair</li> | |
| <li><code>label_column="answers"</code>: the name of the column containing the answers</li> | |
| <li><code>squad_v2_format=None</code>: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.</li></ul> | |
| <p>Let’s have a look how we can evaluate QA models and compute confidence intervals at the same time.</p> | |
| <h3 class="relative group"><a id="confidence-intervals" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#confidence-intervals"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Confidence intervals | |
| </span></h3> | |
| <p>Every evaluator comes with the options to compute confidence intervals using <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html" rel="nofollow">bootstrapping</a>. Simply pass <code>strategy="bootstrap"</code> and set the number of resanmples with <code>n_resamples</code>.</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-keyword">from</span> evaluate <span class="hljs-keyword">import</span> evaluator | |
| task_evaluator = evaluator(<span class="hljs-string">"question-answering"</span>) | |
| data = load_dataset(<span class="hljs-string">"squad"</span>, split=<span class="hljs-string">"validation[:1000]"</span>) | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=<span class="hljs-string">"distilbert-base-uncased-distilled-squad"</span>, | |
| data=data, | |
| metric=<span class="hljs-string">"squad"</span>, | |
| strategy=<span class="hljs-string">"bootstrap"</span>, | |
| n_resamples=<span class="hljs-number">30</span> | |
| )<!-- HTML_TAG_END --></pre></div> | |
| <p>Results include confidence intervals as well as error estimates as follows:</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START -->{ | |
| <span class="hljs-string">'exact_match'</span>: | |
| { | |
| <span class="hljs-string">'confidence_interval'</span>: (<span class="hljs-number">79.67</span>, <span class="hljs-number">84.54</span>), | |
| <span class="hljs-string">'score'</span>: <span class="hljs-number">82.30</span>, | |
| <span class="hljs-string">'standard_error'</span>: <span class="hljs-number">1.28</span> | |
| }, | |
| <span class="hljs-string">'f1'</span>: | |
| { | |
| <span class="hljs-string">'confidence_interval'</span>: (<span class="hljs-number">85.30</span>, <span class="hljs-number">88.88</span>), | |
| <span class="hljs-string">'score'</span>: <span class="hljs-number">87.23</span>, | |
| <span class="hljs-string">'standard_error'</span>: <span class="hljs-number">0.97</span> | |
| }, | |
| <span class="hljs-string">'latency_in_seconds'</span>: <span class="hljs-number">0.0085</span>, | |
| <span class="hljs-string">'samples_per_second'</span>: <span class="hljs-number">117.31</span>, | |
| <span class="hljs-string">'total_time_in_seconds'</span>: <span class="hljs-number">8.52</span> | |
| }<!-- HTML_TAG_END --></pre></div> | |
| <h2 class="relative group"><a id="image-classification" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-classification"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Image classification | |
| </span></h2> | |
| <p>With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier:</p> | |
| <ul><li><code>input_column="image"</code>: the name of the column containing the images as PIL ImageFile</li> | |
| <li><code>label_column="label"</code>: the name of the column containing the labels</li> | |
| <li><code>label_mapping=None</code>: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the <code>label_column</code></li></ul> | |
| <p>Let’s have a look at how can evaluate image classification models on large datasets.</p> | |
| <h3 class="relative group"><a id="handling-large-datasets" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#handling-large-datasets"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> | |
| <span>Handling large datasets | |
| </span></h3> | |
| <p>The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB.</p> | |
| <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> | |
| <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> | |
| Copied</div></button></div> | |
| <pre><!-- HTML_TAG_START -->data = load_dataset(<span class="hljs-string">"imagenet-1k"</span>, split=<span class="hljs-string">"validation"</span>, use_auth_token=<span class="hljs-literal">True</span>) | |
| pipe = pipeline( | |
| task=<span class="hljs-string">"image-classification"</span>, | |
| model=<span class="hljs-string">"facebook/deit-small-distilled-patch16-224"</span> | |
| ) | |
| task_evaluator = evaluator(<span class="hljs-string">"image-classification"</span>) | |
| eval_results = task_evaluator.compute( | |
| model_or_pipeline=pipe, | |
| data=data, | |
| metric=<span class="hljs-string">"accuracy"</span>, | |
| label_mapping=pipe.model.config.label2id | |
| )<!-- HTML_TAG_END --></pre></div> | |
| <p>Since we are using <code>datasets</code> to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big.</p> | |
| <script type="module" data-hydrate="lvrkpk"> | |
| import { start } from "/docs/evaluate/v0.2.2/en/_app/start-hf-doc-builder.js"; | |
| start({ | |
| target: document.querySelector('[data-hydrate="lvrkpk"]').parentNode, | |
| paths: {"base":"/docs/evaluate/v0.2.2/en","assets":"/docs/evaluate/v0.2.2/en"}, | |
| session: {}, | |
| route: false, | |
| spa: false, | |
| trailing_slash: "never", | |
| hydrate: { | |
| status: 200, | |
| error: null, | |
| nodes: [ | |
| import("/docs/evaluate/v0.2.2/en/_app/pages/__layout.svelte-hf-doc-builder.js"), | |
| import("/docs/evaluate/v0.2.2/en/_app/pages/base_evaluator.mdx-hf-doc-builder.js") | |
| ], | |
| params: {} | |
| } | |
| }); | |
| </script> | |
Xet Storage Details
- Size:
- 43.1 kB
- Xet hash:
- 11387261c00736fdba1679aa3d946512bd629c73b609dc35a60ecb1adde80bcb
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.