Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Create a dataset loading script","local":"create-a-dataset-loading-script","sections":[{"title":"Add dataset attributes","local":"add-dataset-attributes","sections":[{"title":"Multiple configurations","local":"multiple-configurations","sections":[],"depth":3},{"title":"Default configurations","local":"default-configurations","sections":[],"depth":3}],"depth":2},{"title":"Download data files and organize splits","local":"download-data-files-and-organize-splits","sections":[],"depth":2},{"title":"Generate samples","local":"generate-samples","sections":[],"depth":2},{"title":"(Optional) Generate dataset metadata","local":"optional-generate-dataset-metadata","sections":[],"depth":2},{"title":"Upload to the Hub","local":"upload-to-the-hub","sections":[],"depth":2},{"title":"Advanced features","local":"advanced-features","sections":[{"title":"Sharding","local":"sharding","sections":[],"depth":3},{"title":"ArrowBasedBuilder","local":"arrowbasedbuilder","sections":[],"depth":3}],"depth":2}],"depth":1}"> | |
| <link href="/docs/datasets/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/entry/start.4d44eea4.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/scheduler.bdbef820.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/singletons.36b689ad.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/index.8a885b74.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/paths.27092e28.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/entry/app.d83067e8.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/index.c0aea24a.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/nodes/0.bfb01985.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/nodes/16.60cffcdf.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/Tip.31005f7d.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/CodeBlock.6ccca92e.js"> | |
| <link rel="modulepreload" href="/docs/datasets/main/en/_app/immutable/chunks/EditOnGithub.725ee0c1.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Create a dataset loading script","local":"create-a-dataset-loading-script","sections":[{"title":"Add dataset attributes","local":"add-dataset-attributes","sections":[{"title":"Multiple configurations","local":"multiple-configurations","sections":[],"depth":3},{"title":"Default configurations","local":"default-configurations","sections":[],"depth":3}],"depth":2},{"title":"Download data files and organize splits","local":"download-data-files-and-organize-splits","sections":[],"depth":2},{"title":"Generate samples","local":"generate-samples","sections":[],"depth":2},{"title":"(Optional) Generate dataset metadata","local":"optional-generate-dataset-metadata","sections":[],"depth":2},{"title":"Upload to the Hub","local":"upload-to-the-hub","sections":[],"depth":2},{"title":"Advanced features","local":"advanced-features","sections":[{"title":"Sharding","local":"sharding","sections":[],"depth":3},{"title":"ArrowBasedBuilder","local":"arrowbasedbuilder","sections":[],"depth":3}],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="create-a-dataset-loading-script" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#create-a-dataset-loading-script"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Create a dataset loading script</span></h1> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1uuckrj">The dataset loading script is likely not needed if your dataset is in one of the following formats: CSV, JSON, JSON lines, text, images, audio or Parquet. | |
| With those formats, you should be able to load your dataset automatically with <a href="/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset">load_dataset()</a>, | |
| as long as your dataset repository has a <a href="./repository_structure">required structure</a>.</p></div> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-1adalis">For security reasons, 🤗 Datasets do not allow running dataset loading scripts by default, and you have to pass <code>trust_remote_code=True</code> to load datasets that require running a dataset script.</p></div> <p data-svelte-h="svelte-1k7fcoj">Write a dataset script to load and share datasets that consist of data files in unsupported formats or require more complex data preparation. | |
| This is a more advanced way to define a dataset than using <a href="./repository_structure#define-your-splits-in-yaml">YAML metadata in the dataset card</a>. | |
| A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.</p> <p data-svelte-h="svelte-gltuje">The script can download data files from any website, or from the same dataset repository.</p> <p data-svelte-h="svelte-1qit7w">A dataset loading script should have the same name as a dataset repository or directory. For example, a repository named <code>my_dataset</code> should contain <code>my_dataset.py</code> script. This way it can be loaded with:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->my<span class="hljs-emphasis">_dataset/ | |
| ├── README.md | |
| └── my_</span>dataset.py<!-- HTML_TAG_END --></pre></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-meta">>>> </span>load_dataset(<span class="hljs-string">"path/to/my_dataset"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-mekosh">The following guide includes instructions for dataset scripts for how to:</p> <ul data-svelte-h="svelte-v28gob"><li>Add dataset metadata.</li> <li>Download data files.</li> <li>Generate samples.</li> <li>Generate dataset metadata.</li> <li>Upload a dataset to the Hub.</li></ul> <p data-svelte-h="svelte-3izox6">Open the <a href="https://huggingface.co/datasets/squad/blob/main/squad.py" rel="nofollow">SQuAD dataset loading script</a> template to follow along on how to share a dataset.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1ews9ox">To help you get started, try beginning with the dataset loading script <a href="https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py" rel="nofollow">template</a>!</p></div> <h2 class="relative group"><a id="add-dataset-attributes" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#add-dataset-attributes"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Add dataset attributes</span></h2> <p data-svelte-h="svelte-1yjv3de">The first step is to add some information, or attributes, about your dataset in <code>DatasetBuilder._info()</code>. The most important attributes you should specify are:</p> <ol data-svelte-h="svelte-1jlvuo9"><li><p><code>DatasetInfo.description</code> provides a concise description of your dataset. The description informs the user what’s in the dataset, how it was collected, and how it can be used for a NLP task.</p></li> <li><p><code>DatasetInfo.features</code> defines the name and type of each column in your dataset. This will also provide the structure for each example, so it is possible to create nested subfields in a column if you want. Take a look at <a href="/docs/datasets/main/en/package_reference/main_classes#datasets.Features">Features</a> for a full list of feature types you can use.</p></li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->datasets.Features( | |
| { | |
| <span class="hljs-string">"id"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"title"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"context"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"question"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"answers"</span>: datasets.<span class="hljs-type">Sequence</span>( | |
| { | |
| <span class="hljs-string">"text"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"answer_start"</span>: datasets.Value(<span class="hljs-string">"int32"</span>), | |
| } | |
| ), | |
| } | |
| )<!-- HTML_TAG_END --></pre></div> <ol start="3" data-svelte-h="svelte-1gzz03n"><li><p><code>DatasetInfo.homepage</code> contains the URL to the dataset homepage so users can find more details about the dataset.</p></li> <li><p><code>DatasetInfo.citation</code> contains a BibTeX citation for the dataset.</p></li></ol> <p data-svelte-h="svelte-j8hcd0">After you’ve filled out all these fields in the template, it should look like the following example from the SQuAD loading script:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">_info</span>(<span class="hljs-params">self</span>): | |
| <span class="hljs-keyword">return</span> datasets.DatasetInfo( | |
| description=_DESCRIPTION, | |
| features=datasets.Features( | |
| { | |
| <span class="hljs-string">"id"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"title"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"context"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"question"</span>: datasets.Value(<span class="hljs-string">"string"</span>), | |
| <span class="hljs-string">"answers"</span>: datasets.features.<span class="hljs-type">Sequence</span>( | |
| {<span class="hljs-string">"text"</span>: datasets.Value(<span class="hljs-string">"string"</span>), <span class="hljs-string">"answer_start"</span>: datasets.Value(<span class="hljs-string">"int32"</span>),} | |
| ), | |
| } | |
| ), | |
| <span class="hljs-comment"># No default supervised_keys (as we have to pass both question</span> | |
| <span class="hljs-comment"># and context as input).</span> | |
| supervised_keys=<span class="hljs-literal">None</span>, | |
| homepage=<span class="hljs-string">"https://rajpurkar.github.io/SQuAD-explorer/"</span>, | |
| citation=_CITATION, | |
| )<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="multiple-configurations" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#multiple-configurations"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Multiple configurations</span></h3> <p data-svelte-h="svelte-141nqtg">In some cases, your dataset may have multiple configurations. For example, the <a href="https://huggingface.co/datasets/super_glue" rel="nofollow">SuperGLUE</a> dataset is a collection of 5 datasets designed to evaluate language understanding tasks. 🤗 Datasets provides <a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.BuilderConfig">BuilderConfig</a> which allows you to create different configurations for the user to select from.</p> <p data-svelte-h="svelte-1wah8i0">Let’s study the <a href="https://huggingface.co/datasets/super_glue/blob/main/super_glue.py" rel="nofollow">SuperGLUE loading script</a> to see how you can define several configurations.</p> <ol data-svelte-h="svelte-a86a0j"><li>Create a <a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.BuilderConfig">BuilderConfig</a> subclass with attributes about your dataset. These attributes can be the features of your dataset, label classes, and a URL to the data files.</li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">class</span> <span class="hljs-title class_">SuperGlueConfig</span>(datasets.BuilderConfig): | |
| <span class="hljs-string">"""BuilderConfig for SuperGLUE."""</span> | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">__init__</span>(<span class="hljs-params">self, features, data_url, citation, url, label_classes=(<span class="hljs-params"><span class="hljs-string">"False"</span>, <span class="hljs-string">"True"</span></span>), **kwargs</span>): | |
| <span class="hljs-string">"""BuilderConfig for SuperGLUE. | |
| Args: | |
| features: *list[string]*, list of the features that will appear in the | |
| feature dict. Should not include "label". | |
| data_url: *string*, url to download the zip file from. | |
| citation: *string*, citation for the data set. | |
| url: *string*, url for information about the data set. | |
| label_classes: *list[string]*, the list of classes for the label if the | |
| label is present as a string. Non-string labels will be cast to either | |
| 'False' or 'True'. | |
| **kwargs: keyword arguments forwarded to super. | |
| """</span> | |
| <span class="hljs-comment"># Version history:</span> | |
| <span class="hljs-comment"># 1.0.2: Fixed non-nondeterminism in ReCoRD.</span> | |
| <span class="hljs-comment"># 1.0.1: Change from the pre-release trial version of SuperGLUE (v1.9) to</span> | |
| <span class="hljs-comment"># the full release (v2.0).</span> | |
| <span class="hljs-comment"># 1.0.0: S3 (new shuffling, sharding and slicing mechanism).</span> | |
| <span class="hljs-comment"># 0.0.2: Initial version.</span> | |
| <span class="hljs-built_in">super</span>().__init__(version=datasets.Version(<span class="hljs-string">"1.0.2"</span>), **kwargs) | |
| self.features = features | |
| self.label_classes = label_classes | |
| self.data_url = data_url | |
| self.citation = citation | |
| self.url = url<!-- HTML_TAG_END --></pre></div> <ol start="2" data-svelte-h="svelte-1hlknx7"><li>Create instances of your config to specify the values of the attributes of each configuration. This gives you the flexibility to specify all the name and description of each configuration. These sub-class instances should be listed under <code>DatasetBuilder.BUILDER_CONFIGS</code>:</li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">class</span> <span class="hljs-title class_">SuperGlue</span>(datasets.GeneratorBasedBuilder): | |
| <span class="hljs-string">"""The SuperGLUE benchmark."""</span> | |
| BUILDER_CONFIG_CLASS = SuperGlueConfig | |
| BUILDER_CONFIGS = [ | |
| SuperGlueConfig( | |
| name=<span class="hljs-string">"boolq"</span>, | |
| description=_BOOLQ_DESCRIPTION, | |
| features=[<span class="hljs-string">"question"</span>, <span class="hljs-string">"passage"</span>], | |
| data_url=<span class="hljs-string">"https://dl.fbaipublicfiles.com/glue/superglue/data/v2/BoolQ.zip"</span>, | |
| citation=_BOOLQ_CITATION, | |
| url=<span class="hljs-string">"https://github.com/google-research-datasets/boolean-questions"</span>, | |
| ), | |
| ... | |
| ... | |
| SuperGlueConfig( | |
| name=<span class="hljs-string">"axg"</span>, | |
| description=_AXG_DESCRIPTION, | |
| features=[<span class="hljs-string">"premise"</span>, <span class="hljs-string">"hypothesis"</span>], | |
| label_classes=[<span class="hljs-string">"entailment"</span>, <span class="hljs-string">"not_entailment"</span>], | |
| data_url=<span class="hljs-string">"https://dl.fbaipublicfiles.com/glue/superglue/data/v2/AX-g.zip"</span>, | |
| citation=_AXG_CITATION, | |
| url=<span class="hljs-string">"https://github.com/rudinger/winogender-schemas"</span>, | |
| ),<!-- HTML_TAG_END --></pre></div> <ol start="3" data-svelte-h="svelte-odp277"><li>Now, users can load a specific configuration of the dataset with the configuration <code>name</code>:</li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-meta">>>> </span>dataset = load_dataset(<span class="hljs-string">'super_glue'</span>, <span class="hljs-string">'boolq'</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1smmo5p">Additionally, users can instantiate a custom builder configuration by passing the builder configuration arguments to <a href="/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset">load_dataset()</a>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-meta">>>> </span>dataset = load_dataset(<span class="hljs-string">'super_glue'</span>, data_url=<span class="hljs-string">"https://custom_url"</span>)<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="default-configurations" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#default-configurations"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Default configurations</span></h3> <p data-svelte-h="svelte-19zng3q">Users must specify a configuration name when they load a dataset with multiple configurations. Otherwise, 🤗 Datasets will raise a <code>ValueError</code>, and prompt the user to select a configuration name. You can avoid this by setting a default dataset configuration with the <code>DEFAULT_CONFIG_NAME</code> attribute:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">class</span> <span class="hljs-title class_">NewDataset</span>(datasets.GeneratorBasedBuilder): | |
| VERSION = datasets.Version(<span class="hljs-string">"1.1.0"</span>) | |
| BUILDER_CONFIGS = [ | |
| datasets.BuilderConfig(name=<span class="hljs-string">"first_domain"</span>, version=VERSION, description=<span class="hljs-string">"This part of my dataset covers a first domain"</span>), | |
| datasets.BuilderConfig(name=<span class="hljs-string">"second_domain"</span>, version=VERSION, description=<span class="hljs-string">"This part of my dataset covers a second domain"</span>), | |
| ] | |
| DEFAULT_CONFIG_NAME = <span class="hljs-string">"first_domain"</span><!-- HTML_TAG_END --></pre></div> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-1h3crem">Only use a default configuration when it makes sense. Don’t set one because it may be more convenient for the user to not specify a configuration when they load your dataset. For example, multi-lingual datasets often have a separate configuration for each language. An appropriate default may be an aggregated configuration that loads all the languages of the dataset if the user doesn’t request a particular one.</p></div> <h2 class="relative group"><a id="download-data-files-and-organize-splits" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#download-data-files-and-organize-splits"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Download data files and organize splits</span></h2> <p data-svelte-h="svelte-jmz9cn">After you’ve defined the attributes of your dataset, the next step is to download the data files and organize them according to their splits.</p> <ol data-svelte-h="svelte-1o1li8d"><li>Create a dictionary of URLs in the loading script that point to the original SQuAD data files:</li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->_URL = <span class="hljs-string">"https://rajpurkar.github.io/SQuAD-explorer/dataset/"</span> | |
| _URLS = { | |
| <span class="hljs-string">"train"</span>: _URL + <span class="hljs-string">"train-v1.1.json"</span>, | |
| <span class="hljs-string">"dev"</span>: _URL + <span class="hljs-string">"dev-v1.1.json"</span>, | |
| }<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1xoorbu">If the data files live in the same folder or repository of the dataset script, you can just pass the relative paths to the files instead of URLs.</p></div> <ol start="2" data-svelte-h="svelte-ht8hl9"><li><p><a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.DownloadManager.download_and_extract">DownloadManager.download_and_extract()</a> takes this dictionary and downloads the data files. Once the files are downloaded, use <a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.SplitGenerator">SplitGenerator</a> to organize each split in the dataset. This is a simple class that contains:</p> <ul><li><p>The <code>name</code> of each split. You should use the standard split names: <code>Split.TRAIN</code>, <code>Split.TEST</code>, and <code>Split.VALIDATION</code>.</p></li> <li><p><code>gen_kwargs</code> provides the file paths to the data files to load for each split.</p></li></ul></li></ol> <p data-svelte-h="svelte-leta0s">Your <code>DatasetBuilder._split_generator()</code> should look like this now:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">_split_generators</span>(<span class="hljs-params">self, dl_manager: datasets.DownloadManager</span>) -> <span class="hljs-type">List</span>[datasets.SplitGenerator]: | |
| urls_to_download = self._URLS | |
| downloaded_files = dl_manager.download_and_extract(urls_to_download) | |
| <span class="hljs-keyword">return</span> [ | |
| datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={<span class="hljs-string">"filepath"</span>: downloaded_files[<span class="hljs-string">"train"</span>]}), | |
| datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={<span class="hljs-string">"filepath"</span>: downloaded_files[<span class="hljs-string">"dev"</span>]}), | |
| ]<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="generate-samples" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#generate-samples"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Generate samples</span></h2> <p data-svelte-h="svelte-1edtbpg">At this point, you have:</p> <ul data-svelte-h="svelte-1dze5ec"><li>Added the dataset attributes.</li> <li>Provided instructions for how to download the data files.</li> <li>Organized the splits.</li></ul> <p data-svelte-h="svelte-atwoq9">The next step is to actually generate the samples in each split.</p> <ol data-svelte-h="svelte-1euwcht"><li><p><code>DatasetBuilder._generate_examples</code> takes the file path provided by <code>gen_kwargs</code> to read and parse the data files. You need to write a function that loads the data files and extracts the columns.</p></li> <li><p>Your function should yield a tuple of an <code>id_</code>, and an example from the dataset.</p></li></ol> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">_generate_examples</span>(<span class="hljs-params">self, filepath</span>): | |
| <span class="hljs-string">"""This function returns the examples in the raw (text) form."""</span> | |
| logger.info(<span class="hljs-string">"generating examples from = %s"</span>, filepath) | |
| <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(filepath) <span class="hljs-keyword">as</span> f: | |
| squad = json.load(f) | |
| <span class="hljs-keyword">for</span> article <span class="hljs-keyword">in</span> squad[<span class="hljs-string">"data"</span>]: | |
| title = article.get(<span class="hljs-string">"title"</span>, <span class="hljs-string">""</span>).strip() | |
| <span class="hljs-keyword">for</span> paragraph <span class="hljs-keyword">in</span> article[<span class="hljs-string">"paragraphs"</span>]: | |
| context = paragraph[<span class="hljs-string">"context"</span>].strip() | |
| <span class="hljs-keyword">for</span> qa <span class="hljs-keyword">in</span> paragraph[<span class="hljs-string">"qas"</span>]: | |
| question = qa[<span class="hljs-string">"question"</span>].strip() | |
| id_ = qa[<span class="hljs-string">"id"</span>] | |
| answer_starts = [answer[<span class="hljs-string">"answer_start"</span>] <span class="hljs-keyword">for</span> answer <span class="hljs-keyword">in</span> qa[<span class="hljs-string">"answers"</span>]] | |
| answers = [answer[<span class="hljs-string">"text"</span>].strip() <span class="hljs-keyword">for</span> answer <span class="hljs-keyword">in</span> qa[<span class="hljs-string">"answers"</span>]] | |
| <span class="hljs-comment"># Features currently used are "context", "question", and "answers".</span> | |
| <span class="hljs-comment"># Others are extracted here for the ease of future expansions.</span> | |
| <span class="hljs-keyword">yield</span> id_, { | |
| <span class="hljs-string">"title"</span>: title, | |
| <span class="hljs-string">"context"</span>: context, | |
| <span class="hljs-string">"question"</span>: question, | |
| <span class="hljs-string">"id"</span>: id_, | |
| <span class="hljs-string">"answers"</span>: {<span class="hljs-string">"answer_start"</span>: answer_starts, <span class="hljs-string">"text"</span>: answers,}, | |
| }<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="optional-generate-dataset-metadata" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#optional-generate-dataset-metadata"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>(Optional) Generate dataset metadata</span></h2> <p data-svelte-h="svelte-pxkn08">Adding dataset metadata is a great way to include information about your dataset. The metadata is stored in the dataset card <code>README.md</code> in YAML. It includes information like the number of examples required to confirm the dataset was correctly generated, and information about the dataset like its <code>features</code>.</p> <p data-svelte-h="svelte-10rffde">Run the following command to generate your dataset metadata in <code>README.md</code> and make sure your new dataset loading script works correctly:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->datasets-cli test path/<span class="hljs-keyword">to</span>/<your-dataset-loading-<span class="hljs-keyword">script</span>> <span class="hljs-comment">--save_info --all_configs</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1p203tz">If your dataset loading script passed the test, you should now have a <code>README.md</code> file in your dataset folder containing a <code>dataset_info</code> field with some metadata.</p> <h2 class="relative group"><a id="upload-to-the-hub" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#upload-to-the-hub"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Upload to the Hub</span></h2> <p data-svelte-h="svelte-1oe26y1">Once your script is ready, <a href="dataset_card">create a dataset card</a> and <a href="share">upload it to the Hub</a>.</p> <p data-svelte-h="svelte-1539djf">Congratulations, you can now load your dataset from the Hub! 🥳</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-meta">>>> </span>load_dataset(<span class="hljs-string">"<username>/my_dataset"</span>)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="advanced-features" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#advanced-features"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Advanced features</span></h2> <h3 class="relative group"><a id="sharding" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#sharding"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Sharding</span></h3> <p data-svelte-h="svelte-1sxj7ge">If your dataset is made of many big files, 🤗 Datasets automatically runs your script in parallel to make it super fast! | |
| It can help if you have hundreds or thousands of TAR archives, or JSONL files like <a href="https://huggingface.co/datasets/oscar/blob/main/oscar.py" rel="nofollow">oscar</a> for example.</p> <p data-svelte-h="svelte-1abrkus">To make it work, we consider lists of files in <code>gen_kwargs</code> to be shards. | |
| Therefore 🤗 Datasets can automatically spawn several workers to run <code>_generate_examples</code> in parallel, and each worker is given a subset of shards to process.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> | |
| <span class="hljs-keyword">class</span> <span class="hljs-title class_">MyShardedDataset</span>(datasets.GeneratorBasedBuilder): | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_split_generators</span>(<span class="hljs-params">self, dl_manager: datasets.DownloadManager</span>) -> <span class="hljs-type">List</span>[datasets.SplitGenerator]: | |
| downloaded_files = dl_manager.download([<span class="hljs-string">f"data/shard_<span class="hljs-subst">{i}</span>.jsonl"</span> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">1024</span>)]) | |
| <span class="hljs-keyword">return</span> [ | |
| datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={<span class="hljs-string">"filepaths"</span>: downloaded_files}), | |
| ] | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_generate_examples</span>(<span class="hljs-params">self, filepaths</span>): | |
| <span class="hljs-comment"># Each worker can be given a slice of the original `filepaths` list defined in the `gen_kwargs`</span> | |
| <span class="hljs-comment"># so that this code can run in parallel on several shards at the same time</span> | |
| <span class="hljs-keyword">for</span> filepath <span class="hljs-keyword">in</span> filepaths: | |
| ...<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-r6j4wy">Users can also specify <code>num_proc=</code> in <code>load_dataset()</code> to specify the number of processes to use as workers.</p> <h3 class="relative group"><a id="arrowbasedbuilder" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#arrowbasedbuilder"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ArrowBasedBuilder</span></h3> <p data-svelte-h="svelte-1s4bc00">For some datasets it can be much faster to yield batches of data rather than examples one by one. | |
| You can speed up the dataset generation by yielding Arrow tables directly, instead of examples. | |
| This is especially useful if your data comes from Pandas DataFrames for example, since the conversion from Pandas to Arrow is as simple as:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> pyarrow <span class="hljs-keyword">as</span> pa | |
| pa_table = pa.Table.from_pandas(df)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-11syk1h">To yield Arrow tables instead of single examples, make your dataset builder inherit from <a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.ArrowBasedBuilder">ArrowBasedBuilder</a> instead of <a href="/docs/datasets/main/en/package_reference/builder_classes#datasets.GeneratorBasedBuilder">GeneratorBasedBuilder</a>, and use <code>_generate_tables</code> instead of <code>_generate_examples</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">class</span> <span class="hljs-title class_">MySuperFastDataset</span>(datasets.ArrowBasedBuilder): | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_generate_tables</span>(<span class="hljs-params">self, filepaths</span>): | |
| idx = <span class="hljs-number">0</span> | |
| <span class="hljs-keyword">for</span> filepath <span class="hljs-keyword">in</span> filepaths: | |
| ... | |
| <span class="hljs-keyword">yield</span> idx, pa_table | |
| idx += <span class="hljs-number">1</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-15xk2c2">Don’t forget to keep your script memory efficient, in case users run them on machines with a low amount of RAM.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/datasets/blob/main/docs/source/dataset_script.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_w3org2 = { | |
| assets: "/docs/datasets/main/en", | |
| base: "/docs/datasets/main/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/datasets/main/en/_app/immutable/entry/start.4d44eea4.js"), | |
| import("/docs/datasets/main/en/_app/immutable/entry/app.d83067e8.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 16], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 66 kB
- Xet hash:
- a7cd0492bd83282e603b15fd77d11f0a465bd31da29501103899c6f566bf7020
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.